Data – Page 3 – Curated SQL

Most of the time, I love Entity Framework, and ORMs in general. These tools make it easier for companies to ship applications. Are the apps perfect? Of course not – but they’re good enough to get to market, bring in revenue to pay salaries, and move a company forwards.

However, just like any tool, if you don’t know how to use it, you’re gonna get hurt.

One classic example popped up again last month with a client who’d used EF Core to design their database for them. The developers just had to say which columns were numbers, dates, or strings, and EF Core handled the rest.

Read on for the scenario.

Comments closed

Reading Data from Deleted Columns

Published 2025-01-07 by Kevin Feasel

Michael J. Swart checks the typewriter ribbon:

It’s hard to destroy data. Even when a column is dropped, the data is still physically there on the data page. We can use the undocumented/unsupported command DBCC PAGE to look at it.

This is tied in with how we can drop a column in SQL Server and have it not take a very long time: because when we drop the column, we’re just modifying the metadata associated with the table and telling SQL Server to ignore this bit here. Do read the whole thing, and also check out a fun comment from Paul White.

Comments closed

Data Professional Annual Survey

Published 2024-12-18 by Kevin Feasel

Brent Ozar is canvassing for survey participants:

Every year, I run a salary survey to help folks have better discussions with their managers about salaries, benefits, and career progression.

Take the survey now here.

The anonymous survey closes Sunday, January 12th. On Tuesday the 14th, I’ll publish the overall responses on the blog in Excel format so you can do slicing & dicing.

Please do fill out the survey. There are enough years of data at this point that we can do some interesting historical trending with it.

Comments closed

Thoughts on Data Document Formats

Published 2024-12-16 by Kevin Feasel

Phil Factor shares some musings:

What can be so difficult in creating a sensible standard for Structured Data Documents? To understand why they tend to get improved into unusable complexity, I’ll need to explain a bit of background.

Structured Data Documents come in three different flavors. There are the text files that represent object data, text files that represent tabular data (rows and columns) and text data for the values of the settings, initialization or configuration of applications.

Read on for Phil’s take on the matter.

Comments closed

Prod Data in Dev

Published 2024-10-24 by Kevin Feasel

Brent Ozar looks at survey results:

No matter which way you slice it, about half are letting developers work with data straight outta production. We’re not masking personally identifiable data before the developers get access to it.

It was the same story about 5 years ago when I asked the same question, and back then, about 2/3 of the time, developers were using production data as-is:

Brent covers some of the challenges involved, and I can add one more: the idea of environments gets really squishy when talking about data science. My development model still needs production data (unless the dev data has the same structural attributes and data distributions as prod), and I don’t really want to train different models in dev/test/prod because, even with the same default data, many algorithms are stochastic in nature: if I run it multiple times, I can end up with different results. And even if I can get the same results by re-running and using a consistent seed, that also introduces a structural instability because I’m relying on a specific seed.

In short, I agree with Brent: this is a tough nut to crack.

Comments closed

An Overview of Differential Privacy

Published 2024-10-04 by Kevin Feasel

Zachary Amos covers a topic of note:

Data analytics tools allow users to quickly and thoroughly analyze large quantities of material, accelerating important processes. However, individuals must ensure to maintain privacy while doing so, especially when working with personally identifiable information (PII).

One possibility is to perform de-identification methods that remove pertinent details. However, evidence has suggested such options are not as effective as once believed. People may still be able to extract enough information from what remains to identify particular parties.

Read on to learn a bit more about the impetus behind differential privacy and a few of the techniques you can use to get there. The real trick with differential privacy is adding the right kind of noise not to distort the distribution of the data, while still not allowing an end user to unearth enough information to identify a specific individual.

Comments closed

Schema Validation in MongoDB

Published 2024-09-16 by Kevin Feasel

Robert Sheldon makes me bite my tongue to prevent making schema quality jokes:

In the previous article in this series, I introduced you to schema validation in MongoDB. I described how you can define validation rules on a collection and how those rules validate document inserts and updates. In this article, I continue the discussion by explaining how to apply validation rules to documents that already exist in a collection. Before you start in on this article, however, I recommend that you first review the previous article for an introduction into schema validation.

The examples in this article demonstrate various concepts for working with schema validation, as it applies to existing documents. I show you how to find documents that conform and don’t conform to the validation rules, as well as how to bypass schema validation when inserting or updating a document. I also show you how to update and delete invalid documents in a collection. Finally, I explain how you can use validation options to override the default schema validation behavior when inserting and updating documents.

Read on to learn more about how you can perform some after-the-fact schema validation.

Comments closed

Enumerating Causes of Dirty and Incomplete Data

Published 2024-08-28 by Kevin Feasel

Joe Celko builds a list and checks it twice:

Many years ago, my wife and I wrote an article for Datamation, a major trade publication at the time, under the title, “Don’t Warehouse Dirty Data!” It’s been referenced quite a few times over the decades but is nowhere to be found using Google these days. The point is, if you have written a report using data, you have no doubt felt the pain of dirty data and it is nothing new.

However, what we never got around to defining was exactly how data gets dirty. Let’s look at some of the ways data get messed up.

I am very slowly working up the nerve to build a longer talk (and YouTube series) on data engineering, and part of that involves understanding why our data tends to be such a mess. Joe has several examples and stories, and I’m sure we could come up with dozens of other reasons.

Comments closed

Contoso Data Generator v2

Published 2024-07-26 by Kevin Feasel

Marco Russo announces an updated product:

I am proud to announce the second version of the Contoso Data Generator!

In January 2022, we released the first version of an open-source project to create a sample relational database for semantic models in Power BI and Analysis Services. That version focused on creating a SQL Server database as a starting point for the semantic model.

We invested in a new version to support more scenarios and products! Yes, Power BI is our primary focus, but 90% of our work could have been helpful for other platforms and architectures, so… why not?

Read on to see how you can use this and generate as much data as you want.

Comments closed

Automate the Power BI Incremental Refresh Policy via Semantic Link Labs

Published 2024-07-10 by Kevin Feasel

Gilbert Quevauvilliers needs to get rid of some data fast:

The scenario here is that quite often there is a requirement to only keep data from a specific start date, or where it should be keeping data for the last N number of years (which is the first day in January).
Currently in Power BI using the default Incremental refresh settings this is not possible. Typically, you must keep more data than is required.
It is best illustrated by using a working example.

Check out that scenario and how you can use the Semantic Link Labs Python library to resolve it.

Comments closed

Category: Data

Entity Framework and Default Data Lengths

Reading Data from Deleted Columns

Data Professional Annual Survey

Thoughts on Data Document Formats

Prod Data in Dev

An Overview of Differential Privacy

Schema Validation in MongoDB

Enumerating Causes of Dirty and Incomplete Data

Contoso Data Generator v2

Automate the Power BI Incremental Refresh Policy via Semantic Link Labs