Category: Data

“Production” in Data Analytics

Published 2022-03-22 by Kevin Feasel

Joey Jablonski brings up an important point:

Data-driven environments have a fundamentally different set of needs around testing, deployment, and visibility then traditional business applications. Data driven environments need access to fresh data on a high level of update frequency to ensure that data engineers and data scientists are able to effect outputs and recommendations on a timeline that has a positive impact on business decisions and customer experiences.

My day job involves running a predictive analytics team. We train models on production data—there’s very little value in training models on artificial dev data (outside of understanding the parameters of the modeling process), so even our development data generally comes from production. I don’t know that I’m sold on data mesh as a solution to this but it’s worth investigation.

Comments closed

Defining Data Quality

Published 2022-03-16 by Kevin Feasel

Ust Oldfield notes the importance of data quality:

We can safely assume this because a lot of organisations do not have data quality at the top of the priority lists. Why might this be the case? Because monitoring data quality and correcting poor quality data is hard. If it was easy, every organisation would have a strategy and method for tracking and improving data quality.
Often, inertia – driven by an overwhelming amount of information to the risks – sets in as it’s difficult to know where to start. So, where do you start?

I’d say there’s an incentive alignment problem with data quality: organizations want it but not enough that they’d trade anything else for it. And agents within the organization consider data quality a chore, so they’re looking for the minimum viable path. Then, for end users, we consider it even more of a chore (or a nuisance). Furthermore, I’m one of those end users who will put in fake data if I can get away with it on the principle that I don’t want you to have my personal information because you’re probably going to sell it or lose it.

Comments closed

The Importance of Data Dictionaries

Published 2021-12-16 by Kevin Feasel

John Morehouse takes us through data dictionaries:

Data professionals—whether they’re database administrators (DBAs), developers, or data scientists—work in a wide and varied landscape usually in flux and filled with challenges. These challenges could range from changing business requirements to keeping up with the sheer velocity at which technology evolves.
It’s also critical for these professionals to understand their organization’s data and how it applies to a given application or business unit. Better outcomes usually come from employing data dictionaries throughout the organization. Through many years of experience in IT professions, I’ve seen the utilization of data dictionaries range from “not at all” to “I’m documenting every possible data attribute known to humankind.” In my experience, data-related projects with data dictionaries as part of the process are far more likely to be successful than projects without them (even extremely populated data dictionaries are more useful than nothing at all). Trust me on this.

Click through to understand why you should trust John on this one.

Comments closed

Determining the Right Batch Size for Deletes

Published 2021-12-09 by Kevin Feasel

Jess Pomfret breaks out the lab coat and safety goggles:

I found myself needing to clear out a large amount of data from a table this week as part of a clean up job. In order to avoid the transaction log catching fire from a long running, massive delete, I wrote the following T-SQL to chunk through the rows that needed to be deleted in batches. The question is though, what’s the optimal batch size?

I usually go with a rule of thumb: 1K for wide tables (in terms of columns and row size) or when there are foreign key constraints, 10K for medium-width tables, and about 25K for narrow tables. But if this is an operation you run frequently, it’s worth experimenting a bit.

Comments closed

What Good Data Governance Means

Published 2021-12-07 by Kevin Feasel

Paul Andrew shares some thoughts on the true meaning of data governance:

… Someone asked me what I thought ‘good’ Data Governance might look like for a given data platform – warehouse or analytics solution. This included all aspects of data governance is was a very broad question.
To add some actual context, not related to Star Wars, this was during the time when Microsoft started talking about version 2 of the Azure Data Catalogue offering and what else could/should be included in a suite of governance tools. Certainly, long before the days of what we now call Azure Purview. It was also a time when GDPR had a lot of focus for the data community as we battled with the principals and technical implications.
Anyway, with a recent fresh perspective on things, I’ve decided to dust off my original notes and attempt to distill the answer of good data governanace into the following areas. What I also find with data governanace that the more I experience in the industry the deeper my perspective on the subject goes.

Paul has put together a lengthy answer on the topic, well worth the read.

Comments closed

Dynamic Data Masking and Granular Unmasking

Published 2021-11-26 by Kevin Feasel

Dennes Torres points out a change to dynamic data masking in Azure SQL DB:

Dynamic data mask is a very interesting security feature allowing us to mask critical fields such as e-mail, phone number, credit card and so on. We can decide what users will be able to see the value of these features or not.
This feature faced many flaws when it was released, but I believe it’s stable now, although It’s not the main security feature you should care about, it can still be very useful.
However, until very recently, this feature was not very useful. If you mask many fields in many different tables, the fields may require different permission levels in order to be unmasked.

I agree that this is definitely not a security feature. But hey, at least it’s a bit more useful than it was before.

Comments closed

T-SQL Tuesday 144 Roundup

Published 2021-11-19 by Kevin Feasel

Victoria Holt recaps T-SQL Tuesday #144:

This month’s T-SQL Tuesday attracted some great responses! Thank you to everyone who participated!
My invitation for this month’s #tsql2sday was 3 fold on sharing your experiences on data governance
– The current cost of data governance versus its benefits
– The amazing things data governance has enabled you to achieve or will enable you to achieve in the future
– The potential uses for Azure Purview within your estates and the automated deployment options for that

Read on for the recap.

Comments closed

The Importance of Data Governance

Published 2021-11-09 by Kevin Feasel

Rob Farley riffs on another T-SQL Tuesday topic:

But the checks that we do are more about things that the database can allow, but are business scenarios that should never happen.
Plenty of businesses seem to recognise these scenarios all too well, and can point them out when they come across them. You hear phrases like “Oh, we know that’s not right, it should be XYZ instead”. And they become reasons why they don’t really trust their data. It’s a data quality issue, and every time someone comes across a data quality issue, they trust the data a little less.

Click through for Rob’s thoughts.

Comments closed

Generating Artificial Data with Databricks Generator

Published 2021-08-13 by Kevin Feasel

Ust Oldfield shows off a new tool:

Databricks Labs is a relatively new offering from Databricks which showcases what their teams have been creating in the field to help their customers. As a Consultant, this makes my life a lot easier as I don’t have to re-invent the wheel and I can use it to demonstrate value in partnering with Databricks. There’s plenty of use cases that I’ll be using, and extending, with my client but the one I want to focus on in this post is the Data Generator.

Read on for an example of how this works. Something not in Ust’s post but worth mentioning is that you can control the distribution of random numeric features. That’s a piece of functionality you often don’t see in data generators.

1 Comment

Generating Mock Data for SQL Server

Published 2021-08-09 by Kevin Feasel

Chad Callihan has a few options for creating fake data:

It’s easy enough to create a handful of records for testing in SQL Server. What if you want 100 rows or 1000 rows? What if you want data that looks more legitimate compared to gibberish? In this post, we’ll look at different ways to generate mock data.

One of the trickiest things about creating mock data is getting the distributions right. For example, ABS(CHECKSUM(NEWID()) is great (just as RAND(CHECKSUM(NEWID())), but the results follow a uniform distribution because of the nature of checksums and random number generators. This makes charting numeric values look unnatural. Here’s an example I put together of generating data off of a normal distribution. It does take more effort, but if you’re generating this fake data to show it to users in tools like Power BI or Tableau, having data follow reasonable distributions is a good thing. That is, use whatever distribution makes sense for the particular data element: uniform, normal, Pareto (power law), gamma, etc.

Comments closed