Press "Enter" to skip to content

Category: Data

Prod Data in Dev

Brent Ozar looks at survey results:

No matter which way you slice it, about half are letting developers work with data straight outta production. We’re not masking personally identifiable data before the developers get access to it.

It was the same story about 5 years ago when I asked the same question, and back then, about 2/3 of the time, developers were using production data as-is:

Brent covers some of the challenges involved, and I can add one more: the idea of environments gets really squishy when talking about data science. My development model still needs production data (unless the dev data has the same structural attributes and data distributions as prod), and I don’t really want to train different models in dev/test/prod because, even with the same default data, many algorithms are stochastic in nature: if I run it multiple times, I can end up with different results. And even if I can get the same results by re-running and using a consistent seed, that also introduces a structural instability because I’m relying on a specific seed.

In short, I agree with Brent: this is a tough nut to crack.

Comments closed

An Overview of Differential Privacy

Zachary Amos covers a topic of note:

Data analytics tools allow users to quickly and thoroughly analyze large quantities of material, accelerating important processes. However, individuals must ensure to maintain privacy while doing so, especially when working with personally identifiable information (PII). 

One possibility is to perform de-identification methods that remove pertinent details. However, evidence has suggested such options are not as effective as once believed. People may still be able to extract enough information from what remains to identify particular parties. 

Read on to learn a bit more about the impetus behind differential privacy and a few of the techniques you can use to get there. The real trick with differential privacy is adding the right kind of noise not to distort the distribution of the data, while still not allowing an end user to unearth enough information to identify a specific individual.

Comments closed

Schema Validation in MongoDB

Robert Sheldon makes me bite my tongue to prevent making schema quality jokes:

In the previous article in this series, I introduced you to schema validation in MongoDB. I described how you can define validation rules on a collection and how those rules validate document inserts and updates. In this article, I continue the discussion by explaining how to apply validation rules to documents that already exist in a collection. Before you start in on this article, however, I recommend that you first review the previous article for an introduction into schema validation.

The examples in this article demonstrate various concepts for working with schema validation, as it applies to existing documents. I show you how to find documents that conform and don’t conform to the validation rules, as well as how to bypass schema validation when inserting or updating a document. I also show you how to update and delete invalid documents in a collection. Finally, I explain how you can use validation options to override the default schema validation behavior when inserting and updating documents.

Read on to learn more about how you can perform some after-the-fact schema validation.

Comments closed

Enumerating Causes of Dirty and Incomplete Data

Joe Celko builds a list and checks it twice:

Many years ago, my wife and I wrote an article for Datamation, a major trade publication at the time, under the title, “Don’t Warehouse Dirty Data!” It’s been referenced quite a few times over the decades but is nowhere to be found using Google these days. The point is, if you have written a report using data, you have no doubt felt the pain of dirty data and it is nothing new.

However, what we never got around to defining was exactly how data gets dirty. Let’s look at some of the ways data get messed up.

I am very slowly working up the nerve to build a longer talk (and YouTube series) on data engineering, and part of that involves understanding why our data tends to be such a mess. Joe has several examples and stories, and I’m sure we could come up with dozens of other reasons.

Comments closed

Contoso Data Generator v2

Marco Russo announces an updated product:

I am proud to announce the second version of the Contoso Data Generator!

In January 2022, we released the first version of an open-source project to create a sample relational database for semantic models in Power BI and Analysis Services. That version focused on creating a SQL Server database as a starting point for the semantic model.

We invested in a new version to support more scenarios and products! Yes, Power BI is our primary focus, but 90% of our work could have been helpful for other platforms and architectures, so… why not?

Read on to see how you can use this and generate as much data as you want.

Comments closed

Automate the Power BI Incremental Refresh Policy via Semantic Link Labs

Gilbert Quevauvilliers needs to get rid of some data fast:

The scenario here is that quite often there is a requirement to only keep data from a specific start date, or where it should be keeping data for the last N number of years (which is the first day in January).

Currently in Power BI using the default Incremental refresh settings this is not possible. Typically, you must keep more data than is required.

It is best illustrated by using a working example.

Check out that scenario and how you can use the Semantic Link Labs Python library to resolve it.

Comments closed

Data Quality Issues in Python-Based Time Series Analysis

Hadi Fadlallah checks out the data:

Time-series data analysis is one of the most important analysis methods because it provides great insights into how situations change with time, which helps in understanding trends and making the right choices. However, there is a high dependence on its quality.

Data quality mistakes in time series data sets have implications that extend over a large area, such as the accuracy and trustworthiness of analyses, as well as their interpretation. For instance, mistakes can be caused by modes of data collection, storage, and processing. Specialists working on these data sets must acknowledge these data quality obstacles.

Read on for several examples of data quality issues you might run into in a time series dataset, as well as their fixes.

Comments closed

Checking for Duplicate Rows with TidyDensity

Steven Sanderson looks for dupes:

Today, we’re diving into a useful new function from the TidyDensity R package: check_duplicate_rows(). This function is designed to efficiently identify duplicate rows within a data frame, providing a logical vector that flags each row as either a duplicate or unique. Let’s explore how this function works and see it in action with some illustrative examples.

Read on to see how it works. Though I am curious about whether there’s an option to ignore certain columns, such as row IDs or other “non-essential” columns you don’t want to include for comparison. Also, checking how it handles NA or NULL would be interesting.

Comments closed

Database Subetting and Data Generation

Phil Factor tells us about two possibilities for loading a lower environment:

When dealing with the development, testing and releasing of new versions of an existing production database, developers like to use their existing production data. In doing so, the development team will be hit with the difficulties of managing and accommodating the large amount of storage used by a typical production database. It’s not a new problem because the practical storage capacity has grown over the years in line with our ingenuity in finding ways of using it.

To deal with using production data for testing, we generally want to reduce its size by extracting a subset of the entities from a ‘production’ database, anonymized and with referential integrity intact. We then deliver this subset to the various development environments.

Phil gets into some detail on the process behind subsetting and then covers data generation as an alternative.

Comments closed