Data – Page 14 – Curated SQL

It’s a very common scenario when we have a database (project) and require some data in selected tables even in the freshly deployed, empty database.
Nomenclature for these data/tables that you can meet are:
– reference data
– master data
– dictionary
Which term have you heard and is closest to you?
By default, SSDT does not support scripting or deploying the data.
Hence: the question is: how we can cope with that?

Read on for the answer.

Comments closed

Generating Anonymous Data

Published 2019-10-03 by Kevin Feasel

Daniel Hutmacher has a nice web API to generate fake customer data:

I’ve been working on a little gadget for a while now, and today I finally got around to completing it and so now I’ve published it for everyone to try out. It’s a web API (wait, wait, don’t go away – it’s for database people!) that creates a randomized list of names, addresses, etc.
In this post, I’ll show you how easy it is to use this service to anonymize a development or test database so you don’t have all that personally identifiable information floating around.

Read the whole thing and check out his service. Also, Daniel was the one who spurred me on to update the theme here to get rid of some problems, so you can thank him for that too.

Comments closed

Predicting Application Problems from the Database

Published 2019-10-01 by Kevin Feasel

Ed Pollack has a pattern for rooting out application problems based on database activity:

We can approach I/O file stats very similarly to how we handled row counts above: Regularly collect data, store it in a reporting table, and then run analytics against it as needed. Since these database metrics are reset when SQL Server services restart, we need to collect a bit more often. We’ll also want to collect often enough to be able to correlate changes to ongoing application activity. Hourly is typically an acceptable collection frequency, but your environment may lend itself to the more frequent or less frequent collection.

What’s nice is that you can get a long way with heuristics and domain knowledge, even before applying data science techniques.

Comments closed

Delta Lake Schema Enforcement

Published 2019-09-25 by Kevin Feasel

Burak Yavuz, et al, explain the concept of schema enforcement with Databricks Delta Lake:

Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema. Like the front desk manager at a busy restaurant that only accepts reservations, it checks to see whether each column in data inserted into the table is on its list of expected columns (in other words, whether each one has a “reservation”), and rejects any writes with columns that aren’t on the list.

Something something “relational database” something something. They also walk us through some examples in a Databricks notebook, so check that out.

Comments closed

Null Checks in Spark DataFrames

Published 2019-09-05 by Kevin Feasel

Bipin Patwardhan gives us four techniques for validating whether data in Spark exists:

The task at hand was pretty simple — we wanted to create a flexible and reusable library of classes that would make the task of data validation (over Spark DataFrames) a breeze. In this article, I will cover a couple of techniques/idioms used for data validation. In particular, I am using the null check (are the contents of a column ‘null’). In order to keep things simple, I will be assuming that the data to be validated has been loaded into a Spark DataFrame named “df.”

Click through for those techniques.

Comments closed

“Big” Data

Published 2019-08-27 by Kevin Feasel

Buck Woody explains that “Big Data” is just data:

A few years ago it was all the rage to talk about “Big Data”. Lots of descriptions of “Big Data” popped up, including the “V’s” (Variety, Velocity, Volume, etc.) that proved very helpful. I even have my own definition:
Big Data is any data you can’t process
in the time you want
with the systems you have

This post is quite reasonable in its depiction of the problem. I extend it a bit further than that and talk about difficulty of processing the data. Nonetheless, read Buck’s full thoughts and check out the Big Data Clusters workshop.

Comments closed

Azure Data Share

Published 2019-08-22 by Kevin Feasel

James Serra takes us through a new product announcement:

A brand new product by Microsoft called Azure Data Share was recently announced. It is in public preview. To explain the product in short, any data which resides in Azure storage can be securely shared between a data provider and a data consumer. It does this by copying a snapshot of the data to the consumer’s subscription (called snapshot-based copying, and in the future there will be in-place sharing). It currently supports ADLS Gen1, ADLS Gen2, and Blob storage, and eventually will support Azure Data Explorer, SQL DB, and SQL DW. Check out the Documentation and a video, and then go try it out.
You can share the data with a few clicks as long as the user you are trying to share with has access to an Azure Subscription and storage account. The copying and updating of the data is handled for you using the Microsoft backbone for best performance, and is encrypted during transit. You can specify the frequency at which the data consumers receive updates. It also is a simple way to control, manage, and monitor all of your data sharing.

This is a smart idea. Sharing data between companies is a key requirement in a lot of B2B solutions, yet methods for sharing range from high-development and medium-friction (create your own API) to low-development and high-friction (FTP, e-mail).

Comments closed

Containers and Data

Published 2019-06-13 by Kevin Feasel

Randolph West argues that you should keep data and containers separated:

Where it gets interesting is that the SQL Server container is also where the database files are stored by default. I raised a point (which Grant and others have already noted in the past) that persisted storage volumes allow us to throw away a SQL Server container without throwing away the database files, provided that the container is set up to use that persisted storage.
For example, I can map the SQL Server container to a local directory on my Ubuntu or Windows Server, or — as is the case with Kubernetes — a second container can serve as the storage volume. SQL Server is then just a compute engine, or a “service” as Anthony points out in the Twitter thread.

Because every rule has a counter-example (even this one), there are cases when you do want the data to live with the container. For example, a test database for a unit test runner should probably live inside the container rather than being a persisted volume. The reason is that you can blow away the database after a test run and start over for the next run. Putting together a database for a hackathon or user group can be another exception for the same reason. But for pretty much every other purpose, I’d rather have a persisted volume.

Comments closed

Azure Open Datasets

Published 2019-05-08 by Kevin Feasel

Jen Stirrup is pleased that Microsoft is bringing back open datasets:

Nearly three years ago, I complained bitterly about the demise of Windows Datamarket, which aimed to provide free, stock datasets for any and every purpose. I was a huge fan of the date dimension and the geography dimension, since they really helped me to get started with data warehousing.
So I’m glad to say that the concept is back, revamped and rebuilt for the data scientists today. Azure Open Datasets will be useful to anyone who wants data for any reason: perhaps for learning, for demos, for improving machine learning accuracy, perhaps.

Go check it out.

Comments closed

A Forensic Accounting Case Study

Published 2019-04-05 by Kevin Feasel

I have a new series I’ve started on applying forensic accounting techniques as a data platform specialist:

Before I dig into my case study, I want to make it absolutely clear that these techniques will help you do a lot more than uncover fraud in your environment. My hope is that there is no fraud going on in your environment and you never need to use these tools for that purpose.
Even with no fraud, there is an excellent reason to learn and use these tools: they help you better understand your data. A common refrain from data platform presenters is “Know your data.” I say it myself. Then we do some hand-waving stuff, give a few examples of what that entails, and go on to the main point of whatever talks we’re giving. Well, this series is dedicated to knowing your data and giving you the right tools to learn and know your data.

This first post sets the scene, with subsequent posts getting into detail on the technical aspects.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Data

Scripting and Deploying Data with SSDT

Generating Anonymous Data

Predicting Application Problems from the Database

Delta Lake Schema Enforcement

Null Checks in Spark DataFrames

“Big” Data

Azure Data Share

Containers and Data

Azure Open Datasets

A Forensic Accounting Case Study