In order to show you the solution, I want to build up a reasonable sized sample. Any solution looks great when reading five records, but let’s kick that up a notch. Or, more specifically, a million notches: I’m going to use a CTE tally table and load 5 million rows.
I want some realistic looking data, so I’ve adapted Dallas Snider’s strategy to build a data set which approximates a normal distribution.
Because this is a little complicated, I wanted to take the time and explain the data load process in detail in its own post, and then apply it in the follow-up post. We’ll start with a relatively small number of records for this demonstration: 50,000. The reason is that you can generate 50K records almost instantaneously but once you start getting a couple orders of magnitude larger, things slow down some.
If you do custom data generation for lower environments, I’d recommend checking this out. Your production data probably doesn’t follow a normal distribution exactly, but a normal distribution is probably closer to reality than the uniform distribution you get with functions like RAND().
The “Maintained by Kaggle” badge means that Kaggle is now and will continue to actively maintain that dataset. This includes regular updates to descriptions and metadata, quicker response rates in discussion, and accurate current data from the source. Our goal is to create seamless workflows that allow everyone to do data science on Kaggle and be confident in the data they work with.
They have several data sets available from different open data projects for several cities, as well as NOAA and the World Bank. If you’re looking for data sets to play with, this is a good option.
The idea of k-anonymity is that every database record appears at least k times. If you have a lot of records and few fields, your value of k could be high. But as you get more fields, it becomes more likely that a combination of fields is unique. If k = 1, then k-anonymity offers no anonymity.
Another problem with k-anonymity is that it doesn’t offer group privacy. A database could be k-anonymous but reveal information about a group if that group is homogeneous with respect to some field. That is, the method is subject to a homogeneity attack.
This is intended to be a “get you thinking” type of post, and John does have links to related posts which flesh things out a bit more.
By large databases I’m roughly meaning databases with several hundred tables, and I usually see a lot of these tables with several hundred GB’s of data in them.
When I generally ask about the reason for no foreign key, I’m told
- they add overhead
- they give no benefit
- we can’t enter our data properly when we have them
The last one in the above list is generally down to poor modelling – an example being entering a later part of a financial transaction before the first part of the financial transaction has taken place. Once both parts of the financial transaction have taken place then the database is left in a consistent state – BUT, that generally being handled by the app NOT the database – OUCH!
There are times where key constraints are too much—often-updating fact tables might be one such scenario. And some of “too much” comes down to hardware quality. But for the most part, those key constraints are one of the clearest forms of database documentation available, not to mention their normal benefits.
The method that I liked the most and finally settled on was to use a public dataset. I wanted data which was skewed in real ways and did not require a lot of work to massage. Microsoft has a great listing of public datasets here.
I decided to go with the NYC Taxi and Limousine Commission (TLC) Trip Record Data. Data is available for most taxi and limousine fares with pickup/drop-off and distance information between January 2009 and June 2018. This includes data for Yellow cab, Green cab, and for hire vehicles. Just the Yellow cab data from 01/2016 – 06/2018 is over 112,000,000 records (24 GBs) and they download into easy to import comma separated values (CSV) files.
Read on to see how you can set it up yourself. As Derik points out at the end, though, this is still one big table, but there are a few columns which can lead to dimensions, things like rate code, location, and payment type.
A few weeks ago I came across this blog post by Matt Chapman. Matt filed FOIA requests with the City of Chicago and, after multiple attempts, was able to get access to over 36 million parking tickets written between 2003 and 2016. Matt goes on to explain Chicago’s parking ticket database, how he processed the data, analyzed it, and in one location got Chicago to put up additional “No Parking” signs to reduce parking tickets in that spot by 50%. That is most definitely using analytics for a great cause!
But let’s get back to that data for a second, that’s what really interests me. Matt shared his raw data for others to analyze, but it was formatted as a PostgreSQL dump. Now PostgreSQL is a great tool with an even greater price, but it’s not always the easiest to use. After spinning up a Linux VM and spending hours setting everything up as best I could, I still couldn’t get the dump to restore properly. Apparently I didn’t have all the exact versions of certain extensions installed, and because of that the tables couldn’t be loaded. Grrrr.
Bob has our backs, though, and has a properly-formatting, normalized parking ticket data set that weighs in at about 500MB.
The greatest stumbling block our respondents identified as hindering their attempts at better utilizing data is one that has existed for some time but seems to have worsened as data volumes have grown – data silos. Only 2 percent of our respondents considered their business to be completely effective at data sharing – for the rest, data silos are a real problem.
The causes for this are numerous, and span inconsistency of systems being used (42 percent), different data formats (38 percent), and a lack of coordinated data strategy (37 percent). On top of this, over a third highlight a lack of technology integration (36 percent) and/or legacy technology barriers (36 percent) as blocking attempts to effectively share data.
My first response is to say that this is in part due to the growth of microservices architecture, which seems to push data siloing. But at the same time, this has been the case for a long time, so I don’t think it’s either a necessary or a sufficient explanation.
Someone recently told me about a data analysis application written in Python. He managed five Java engineers who built the cluster management and pipeline infrastructure needed to make the analysis run in the 12 hours allotted. They used Python, he said, because it was “easy,” which it was, if you ignore all the work needed to make it go fast. It seemed pretty clear to me that it could have been written in Java to run on a single machine with a much smaller staff.
One definition of “big data” is “Data that is too big to fit on one machine.” By that definition what is “big data” for one language is plain-old “data” for another. Java, with it’s efficient memory management, high performance, and multi-threading can get a lot done on one machine. To do data science in Java, however, you need data science tools: Tablesaw is an open-source (Apache 2) Java data science platform that lets users work with data on a single machine. It’s a dataframe and visualization framework. Most data science currently done in clusters could be done on a single machine using Tablesaw paired with a Java machine learning library like Smile.
But you don’t have to take my word for that.
There are some interesting thoughts in this post, but there are limits to what a single machine can do.
The other day, I had a problem with some data that I never dreamed I would ever see. In a case insensitive database, in a table’s column that was case insensitive, the customer was using the data as case sensitive. Firstly, let’s just go ahead and say it. “This was a sucky implementation.” But as is common, in my typical role as a data architect in the data warehousing team, I get to learn all sorts of interesting techniques for finding and dealing with “data” that has been used in “interesting” ways.
What is kind of interesting is actually figuring out what that duplicated data was. The case that I was dealing with wasn’t a kind of useful packed surrogate value, where you may use a base 62 number, with a-z, A-Z and 0-9 as characters. So 1, 2, … , 9, 0, a, b, c, … x, y, z, A, B.. etc. 1A1 is a different value in that sequence than 1a1, and is greater . Neat technique, and one that I have been threatening to develop using a SEQUENCE object, where you can pack in a lot of sequential data in a small number of bytes. No, this wasn’t a useful case such as this, in this case, one value was lower case, another had leading capitals. So perhaps “active customer” and “Active Customer”. Yeah, seriously, they meant different things.
Louis shows some of the nuance required in making this work.
I’ve written extensively about the tremendous potential for big data in healthcare to drive enormous changes in how we keep people healthy for longer. It goes without saying however that all data is not created equal, and just having a large sample is not always sufficient to get the best insights.
If we needed reminding, a reminder comes via a recent study from the University of California, Berkeley. It suggests that things like emotion, behavior, and physiology vary hugely between individuals, therefore having an average over a large dataset can still produce a ‘norm’ that is wide of the mark for individuals.
“If you want to know what individuals feel or how they become sick, you have to conduct research on individuals, not on groups,” the researchers say. “Diseases, mental disorders, emotions, and behaviors are expressed within individual people, over time. A snapshot of many people at one moment in time can’t capture these phenomena.”
Variance is important.