There are problems with K-anonymous datasets, namely the homogeneous pattern attack, and the background knowledge attack, details of which are in my original post. A slightly different approach to anonymising public datasets comes in the form of ℓ -diversity, a way of introducing further entropy/diversity into a dataset.
A sensitive data record is made of the following microdata types: the ID; any Key Attributes; and the confidential outcome attribute(s). ℓ -diversity seeks to extend the equivalence classes that we created using K-anonymity by generalisation and masking of the quasi-identifiers (the QI groups) to the confidential attributes in the record as well. The ℓ -diversity principle demands that, in each QI-group, at most 1/ ℓ of its tuples can have an identical sensitive attribute value.
L-diversity is not perfect either, but Duncan gives a good explanation of the topic.
One gets some information about the size of the data files and the used code files. I also tried to find and extract a README file from each supplement. Most README files explain whether all results can be replicated with the provided data sets or whether some results require confidential or proprietary data sets. The link allows you to look at the README without the need to download the whole data set.
The main idea is that such a search function could be helpful for teaching economics and data science. For example, my students can use the app to find an interesting topic for a Bachelor or Master Thesis in form of an interactive analysis with RTutor. You could also generate a topic list for a seminar, in which students shall replicate some key findings of a resarch article.
I like this idea, particularly because it promotes the notion that if you’re going to write a paper based on a data set, you ought to provide the data set. There are too many cases of typos or accidental miscodings which take an interesting result and render it mundane (or sometimes even the exact opposite of what the paper reads). H/T R-Bloggers
In order to show you the solution, I want to build up a reasonable sized sample. Any solution looks great when reading five records, but let’s kick that up a notch. Or, more specifically, a million notches: I’m going to use a CTE tally table and load 5 million rows.
I want some realistic looking data, so I’ve adapted Dallas Snider’s strategy to build a data set which approximates a normal distribution.
Because this is a little complicated, I wanted to take the time and explain the data load process in detail in its own post, and then apply it in the follow-up post. We’ll start with a relatively small number of records for this demonstration: 50,000. The reason is that you can generate 50K records almost instantaneously but once you start getting a couple orders of magnitude larger, things slow down some.
If you do custom data generation for lower environments, I’d recommend checking this out. Your production data probably doesn’t follow a normal distribution exactly, but a normal distribution is probably closer to reality than the uniform distribution you get with functions like RAND().
The “Maintained by Kaggle” badge means that Kaggle is now and will continue to actively maintain that dataset. This includes regular updates to descriptions and metadata, quicker response rates in discussion, and accurate current data from the source. Our goal is to create seamless workflows that allow everyone to do data science on Kaggle and be confident in the data they work with.
They have several data sets available from different open data projects for several cities, as well as NOAA and the World Bank. If you’re looking for data sets to play with, this is a good option.
The idea of k-anonymity is that every database record appears at least k times. If you have a lot of records and few fields, your value of k could be high. But as you get more fields, it becomes more likely that a combination of fields is unique. If k = 1, then k-anonymity offers no anonymity.
Another problem with k-anonymity is that it doesn’t offer group privacy. A database could be k-anonymous but reveal information about a group if that group is homogeneous with respect to some field. That is, the method is subject to a homogeneity attack.
This is intended to be a “get you thinking” type of post, and John does have links to related posts which flesh things out a bit more.
By large databases I’m roughly meaning databases with several hundred tables, and I usually see a lot of these tables with several hundred GB’s of data in them.
When I generally ask about the reason for no foreign key, I’m told
- they add overhead
- they give no benefit
- we can’t enter our data properly when we have them
The last one in the above list is generally down to poor modelling – an example being entering a later part of a financial transaction before the first part of the financial transaction has taken place. Once both parts of the financial transaction have taken place then the database is left in a consistent state – BUT, that generally being handled by the app NOT the database – OUCH!
There are times where key constraints are too much—often-updating fact tables might be one such scenario. And some of “too much” comes down to hardware quality. But for the most part, those key constraints are one of the clearest forms of database documentation available, not to mention their normal benefits.
The method that I liked the most and finally settled on was to use a public dataset. I wanted data which was skewed in real ways and did not require a lot of work to massage. Microsoft has a great listing of public datasets here.
I decided to go with the NYC Taxi and Limousine Commission (TLC) Trip Record Data. Data is available for most taxi and limousine fares with pickup/drop-off and distance information between January 2009 and June 2018. This includes data for Yellow cab, Green cab, and for hire vehicles. Just the Yellow cab data from 01/2016 – 06/2018 is over 112,000,000 records (24 GBs) and they download into easy to import comma separated values (CSV) files.
Read on to see how you can set it up yourself. As Derik points out at the end, though, this is still one big table, but there are a few columns which can lead to dimensions, things like rate code, location, and payment type.
A few weeks ago I came across this blog post by Matt Chapman. Matt filed FOIA requests with the City of Chicago and, after multiple attempts, was able to get access to over 36 million parking tickets written between 2003 and 2016. Matt goes on to explain Chicago’s parking ticket database, how he processed the data, analyzed it, and in one location got Chicago to put up additional “No Parking” signs to reduce parking tickets in that spot by 50%. That is most definitely using analytics for a great cause!
But let’s get back to that data for a second, that’s what really interests me. Matt shared his raw data for others to analyze, but it was formatted as a PostgreSQL dump. Now PostgreSQL is a great tool with an even greater price, but it’s not always the easiest to use. After spinning up a Linux VM and spending hours setting everything up as best I could, I still couldn’t get the dump to restore properly. Apparently I didn’t have all the exact versions of certain extensions installed, and because of that the tables couldn’t be loaded. Grrrr.
Bob has our backs, though, and has a properly-formatting, normalized parking ticket data set that weighs in at about 500MB.
The greatest stumbling block our respondents identified as hindering their attempts at better utilizing data is one that has existed for some time but seems to have worsened as data volumes have grown – data silos. Only 2 percent of our respondents considered their business to be completely effective at data sharing – for the rest, data silos are a real problem.
The causes for this are numerous, and span inconsistency of systems being used (42 percent), different data formats (38 percent), and a lack of coordinated data strategy (37 percent). On top of this, over a third highlight a lack of technology integration (36 percent) and/or legacy technology barriers (36 percent) as blocking attempts to effectively share data.
My first response is to say that this is in part due to the growth of microservices architecture, which seems to push data siloing. But at the same time, this has been the case for a long time, so I don’t think it’s either a necessary or a sufficient explanation.
Someone recently told me about a data analysis application written in Python. He managed five Java engineers who built the cluster management and pipeline infrastructure needed to make the analysis run in the 12 hours allotted. They used Python, he said, because it was “easy,” which it was, if you ignore all the work needed to make it go fast. It seemed pretty clear to me that it could have been written in Java to run on a single machine with a much smaller staff.
One definition of “big data” is “Data that is too big to fit on one machine.” By that definition what is “big data” for one language is plain-old “data” for another. Java, with it’s efficient memory management, high performance, and multi-threading can get a lot done on one machine. To do data science in Java, however, you need data science tools: Tablesaw is an open-source (Apache 2) Java data science platform that lets users work with data on a single machine. It’s a dataframe and visualization framework. Most data science currently done in clusters could be done on a single machine using Tablesaw paired with a Java machine learning library like Smile.
But you don’t have to take my word for that.
There are some interesting thoughts in this post, but there are limits to what a single machine can do.