112 Million Cab Rides In Azure SQL Data Warehouse

Kevin Feasel

2018-10-23

Cloud, Data

Derik Hammer wants a real test of Azure SQL Data Warehouse:

The method that I liked the most and finally settled on was to use a public dataset. I wanted data which was skewed in real ways and did not require a lot of work to massage. Microsoft has a great listing of public datasets here.

I decided to go with the NYC Taxi and Limousine Commission (TLC) Trip Record Data. Data is available for most taxi and limousine fares with pickup/drop-off and distance information between January 2009 and June 2018. This includes data for Yellow cab, Green cab, and for hire vehicles. Just the Yellow cab data from 01/2016 – 06/2018 is over 112,000,000 records (24 GBs) and they download into easy to import comma separated values (CSV) files.

Read on to see how you can set it up yourself.  As Derik points out at the end, though, this is still one big table, but there are a few columns which can lead to dimensions, things like rate code, location, and payment type.

Related Posts

Economic Articles With Data Included

Sebastian Kranz has a Shiny app to help you find economic papers with included data: One gets some information about the size of the data files and the used code files. I also tried to find and extract a README file from each supplement. Most README files explain whether all results can be replicated with […]

Read More

Azure Data Lake Store Gen2

James Serra gives us the low-down on Azure Data Lake Store Gen2 now that it is generally available: When to use Blob vs ADLS Gen2New analytics projects should use ADLS Gen2, and current Blob storage should be converted to ADLS Gen2, unless these are non-analytical use cases that only need object storage rather than hierarchical storage […]

Read More

Categories

October 2018
MTWTFSS
« Sep Nov »
1234567
891011121314
15161718192021
22232425262728
293031