The task at hand was pretty simple — we wanted to create a flexible and reusable library of classes that would make the task of data validation (over Spark DataFrames) a breeze. In this article, I will cover a couple of techniques/idioms used for data validation. In particular, I am using the null check (are the contents of a column ‘null’). In order to keep things simple, I will be assuming that the data to be validated has been loaded into a Spark DataFrame named “df.”
Click through for those techniques.
A few years ago it was all the rage to talk about “Big Data”. Lots of descriptions of “Big Data” popped up, including the “V’s” (Variety, Velocity, Volume, etc.) that proved very helpful. I even have my own definition:
Big Data is any data you can’t process
in the time you want
with the systems you have
This post is quite reasonable in its depiction of the problem. I extend it a bit further than that and talk about difficulty of processing the data. Nonetheless, read Buck’s full thoughts and check out the Big Data Clusters workshop.
A brand new product by Microsoft called Azure Data Share was recently announced. It is in public preview. To explain the product in short, any data which resides in Azure storage can be securely shared between a data provider and a data consumer. It does this by copying a snapshot of the data to the consumer’s subscription (called snapshot-based copying, and in the future there will be in-place sharing). It currently supports ADLS Gen1, ADLS Gen2, and Blob storage, and eventually will support Azure Data Explorer, SQL DB, and SQL DW. Check out the Documentation and a video, and then go try it out.
You can share the data with a few clicks as long as the user you are trying to share with has access to an Azure Subscription and storage account. The copying and updating of the data is handled for you using the Microsoft backbone for best performance, and is encrypted during transit. You can specify the frequency at which the data consumers receive updates. It also is a simple way to control, manage, and monitor all of your data sharing.
This is a smart idea. Sharing data between companies is a key requirement in a lot of B2B solutions, yet methods for sharing range from high-development and medium-friction (create your own API) to low-development and high-friction (FTP, e-mail).
Where it gets interesting is that the SQL Server container is also where the database files are stored by default. I raised a point (which Grant and others have already noted in the past) that persisted storage volumes allow us to throw away a SQL Server container without throwing away the database files, provided that the container is set up to use that persisted storage.
For example, I can map the SQL Server container to a local directory on my Ubuntu or Windows Server, or — as is the case with Kubernetes — a second container can serve as the storage volume. SQL Server is then just a compute engine, or a “service” as Anthony points out in the Twitter thread.
Because every rule has a counter-example (even this one), there are cases when you do want the data to live with the container. For example, a test database for a unit test runner should probably live inside the container rather than being a persisted volume. The reason is that you can blow away the database after a test run and start over for the next run. Putting together a database for a hackathon or user group can be another exception for the same reason. But for pretty much every other purpose, I’d rather have a persisted volume.
Nearly three years ago, I complained bitterly about the demise of Windows Datamarket, which aimed to provide free, stock datasets for any and every purpose. I was a huge fan of the date dimension and the geography dimension, since they really helped me to get started with data warehousing.
So I’m glad to say that the concept is back, revamped and rebuilt for the data scientists today. Azure Open Datasets will be useful to anyone who wants data for any reason: perhaps for learning, for demos, for improving machine learning accuracy, perhaps.
Go check it out.
Before I dig into my case study, I want to make it absolutely clear that these techniques will help you do a lot more than uncover fraud in your environment. My hope is that there is no fraud going on in your environment and you never need to use these tools for that purpose.
Even with no fraud, there is an excellent reason to learn and use these tools: they help you better understand your data. A common refrain from data platform presenters is “Know your data.” I say it myself. Then we do some hand-waving stuff, give a few examples of what that entails, and go on to the main point of whatever talks we’re giving. Well, this series is dedicated to knowing your data and giving you the right tools to learn and know your data.
This first post sets the scene, with subsequent posts getting into detail on the technical aspects.
There are problems with K-anonymous datasets, namely the homogeneous pattern attack, and the background knowledge attack, details of which are in my original post. A slightly different approach to anonymising public datasets comes in the form of ℓ -diversity, a way of introducing further entropy/diversity into a dataset.
A sensitive data record is made of the following microdata types: the ID; any Key Attributes; and the confidential outcome attribute(s). ℓ -diversity seeks to extend the equivalence classes that we created using K-anonymity by generalisation and masking of the quasi-identifiers (the QI groups) to the confidential attributes in the record as well. The ℓ -diversity principle demands that, in each QI-group, at most 1/ ℓ of its tuples can have an identical sensitive attribute value.
L-diversity is not perfect either, but Duncan gives a good explanation of the topic.
One gets some information about the size of the data files and the used code files. I also tried to find and extract a README file from each supplement. Most README files explain whether all results can be replicated with the provided data sets or whether some results require confidential or proprietary data sets. The link allows you to look at the README without the need to download the whole data set.
The main idea is that such a search function could be helpful for teaching economics and data science. For example, my students can use the app to find an interesting topic for a Bachelor or Master Thesis in form of an interactive analysis with RTutor. You could also generate a topic list for a seminar, in which students shall replicate some key findings of a resarch article.
I like this idea, particularly because it promotes the notion that if you’re going to write a paper based on a data set, you ought to provide the data set. There are too many cases of typos or accidental miscodings which take an interesting result and render it mundane (or sometimes even the exact opposite of what the paper reads). H/T R-Bloggers
In order to show you the solution, I want to build up a reasonable sized sample. Any solution looks great when reading five records, but let’s kick that up a notch. Or, more specifically, a million notches: I’m going to use a CTE tally table and load 5 million rows.
I want some realistic looking data, so I’ve adapted Dallas Snider’s strategy to build a data set which approximates a normal distribution.
Because this is a little complicated, I wanted to take the time and explain the data load process in detail in its own post, and then apply it in the follow-up post. We’ll start with a relatively small number of records for this demonstration: 50,000. The reason is that you can generate 50K records almost instantaneously but once you start getting a couple orders of magnitude larger, things slow down some.
If you do custom data generation for lower environments, I’d recommend checking this out. Your production data probably doesn’t follow a normal distribution exactly, but a normal distribution is probably closer to reality than the uniform distribution you get with functions like RAND().
The “Maintained by Kaggle” badge means that Kaggle is now and will continue to actively maintain that dataset. This includes regular updates to descriptions and metadata, quicker response rates in discussion, and accurate current data from the source. Our goal is to create seamless workflows that allow everyone to do data science on Kaggle and be confident in the data they work with.
They have several data sets available from different open data projects for several cities, as well as NOAA and the World Bank. If you’re looking for data sets to play with, this is a good option.