Data – Page 4 – Curated SQL

An Overview of DataDiluvium

Published 2025-03-20 by Kevin Feasel

Adron Hall has a new tool and a new blog series. The first post is a product overview:

DataDiluvium is a web-based tool available at datadiluvium.com that helps developers, database administrators, and data engineers generate realistic test data from SQL schema definitions. Whether you’re setting up a development environment, creating test scenarios, or preparing data for demonstrations, DataDiluvium streamlines the process of data generation.

The second covers some of the development precepts Adron used:

DataDiluvium is a web-based tool I’ve built designed to help developers, database administrators, and data engineers generate realistic test data based on SQL schema definitions. The tool takes SQL table definitions as input and produces sample data in various formats, making it easier to populate development and testing environments with meaningful data.

The tool is free, so if you’re looking for a sample data generator, check it out.

Comments closed

2025 Data Professional Salary Survey Results

Published 2025-01-15 by Kevin Feasel

Brent Ozar shares this year’s survey results:

We’ve been running our annual Data Professional Salary Survey for almost a decade, and I was really curious to see what the results would hold this year. How would inflation and layoffs impact the database world? Download the raw data here and slice & dice it to see what’s important to you. Here’s what I found.

Read on for Brent’s analysis and grab the data for yourself to try things out. I’ve used this dataset in the past for presentations and it usually goes over pretty well, especially because it includes quite a few real-life data quality challenges.

Comments closed

Parquet File Customization and SQL Server

Published 2025-01-13 by Kevin Feasel

Ed Pollack writes some files:

Previously, we introduced and discussed the Parquet file format and SQL Server and why it is an ideal format for storing analytic data when it does not already reside in a native analytic data store, such as a data lake, data warehouse, or an Azure managed service.

Both Python and the Parquet file format are quite flexible, allowing for significant customization to ensure that file-related tasks are as optimal as possible. Compatibility with other processes, as well as keeping file sizes and properties under control will also be introduced here.

Click through for some examples.

Comments closed

Dealing with Duplicate Data via ROW_NUMBER()

Published 2025-01-10 by Kevin Feasel

Andy Brownsword removes the duplicates:

Data quality and consistency is key to the services we support and solutions we deliver. A gremlin which can undermine that is duplicate data. Let’s start the new year dealing with duplicate data and having a good clear-out.

For our example we’ll consider an Order Product table which contains an OrderID and ProductID, and the combination of these should be unique. Other fields for the duplicate records may differ so we may want to be selective about which records are removed.

This is where I get on my high horse and complain about laziness in data modeling, a very common problem. This takes nothing away from Andy’s post, which is a good method for solving a problem that has gotten out of hand. But if you know that some combination of attributes is unique, add a unique key constraint or a unique non-clustered index right then and there. Doing so will prevent improper duplicate data from ever being an issue. If you don’t know that some combination of attributes must be unique, discuss this with the business side in a way that makes sense for them. Yes, there’s always the risk that you’ll have a conversation later like, “Oh, it turns out that this really should be unique,” but in most cases, you can easily sort this kind of thing out up-front and save a lot of time and effort later on.

Comments closed

Entity Framework and Default Data Lengths

Published 2025-01-09 by Kevin Feasel

Brent Ozar points out one issue you might run into when using Entity Framework:

Most of the time, I love Entity Framework, and ORMs in general. These tools make it easier for companies to ship applications. Are the apps perfect? Of course not – but they’re good enough to get to market, bring in revenue to pay salaries, and move a company forwards.

However, just like any tool, if you don’t know how to use it, you’re gonna get hurt.

One classic example popped up again last month with a client who’d used EF Core to design their database for them. The developers just had to say which columns were numbers, dates, or strings, and EF Core handled the rest.

Read on for the scenario.

Comments closed

Reading Data from Deleted Columns

Published 2025-01-07 by Kevin Feasel

Michael J. Swart checks the typewriter ribbon:

It’s hard to destroy data. Even when a column is dropped, the data is still physically there on the data page. We can use the undocumented/unsupported command DBCC PAGE to look at it.

This is tied in with how we can drop a column in SQL Server and have it not take a very long time: because when we drop the column, we’re just modifying the metadata associated with the table and telling SQL Server to ignore this bit here. Do read the whole thing, and also check out a fun comment from Paul White.

Comments closed

Data Professional Annual Survey

Published 2024-12-18 by Kevin Feasel

Brent Ozar is canvassing for survey participants:

Every year, I run a salary survey to help folks have better discussions with their managers about salaries, benefits, and career progression.

Take the survey now here.

The anonymous survey closes Sunday, January 12th. On Tuesday the 14th, I’ll publish the overall responses on the blog in Excel format so you can do slicing & dicing.

Please do fill out the survey. There are enough years of data at this point that we can do some interesting historical trending with it.

Comments closed

Thoughts on Data Document Formats

Published 2024-12-16 by Kevin Feasel

Phil Factor shares some musings:

What can be so difficult in creating a sensible standard for Structured Data Documents? To understand why they tend to get improved into unusable complexity, I’ll need to explain a bit of background.

Structured Data Documents come in three different flavors. There are the text files that represent object data, text files that represent tabular data (rows and columns) and text data for the values of the settings, initialization or configuration of applications.

Read on for Phil’s take on the matter.

Comments closed

Prod Data in Dev

Published 2024-10-24 by Kevin Feasel

Brent Ozar looks at survey results:

No matter which way you slice it, about half are letting developers work with data straight outta production. We’re not masking personally identifiable data before the developers get access to it.

It was the same story about 5 years ago when I asked the same question, and back then, about 2/3 of the time, developers were using production data as-is:

Brent covers some of the challenges involved, and I can add one more: the idea of environments gets really squishy when talking about data science. My development model still needs production data (unless the dev data has the same structural attributes and data distributions as prod), and I don’t really want to train different models in dev/test/prod because, even with the same default data, many algorithms are stochastic in nature: if I run it multiple times, I can end up with different results. And even if I can get the same results by re-running and using a consistent seed, that also introduces a structural instability because I’m relying on a specific seed.

In short, I agree with Brent: this is a tough nut to crack.

Comments closed

An Overview of Differential Privacy

Published 2024-10-04 by Kevin Feasel

Zachary Amos covers a topic of note:

Data analytics tools allow users to quickly and thoroughly analyze large quantities of material, accelerating important processes. However, individuals must ensure to maintain privacy while doing so, especially when working with personally identifiable information (PII).

One possibility is to perform de-identification methods that remove pertinent details. However, evidence has suggested such options are not as effective as once believed. People may still be able to extract enough information from what remains to identify particular parties.

Read on to learn a bit more about the impetus behind differential privacy and a few of the techniques you can use to get there. The real trick with differential privacy is adding the right kind of noise not to distort the distribution of the data, while still not allowing an end user to unearth enough information to identify a specific individual.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Category: Data