112 Million Cab Rides In Azure SQL Data Warehouse

Kevin Feasel

2018-10-23

Cloud, Data

Derik Hammer wants a real test of Azure SQL Data Warehouse:

The method that I liked the most and finally settled on was to use a public dataset. I wanted data which was skewed in real ways and did not require a lot of work to massage. Microsoft has a great listing of public datasets here.

I decided to go with the NYC Taxi and Limousine Commission (TLC) Trip Record Data. Data is available for most taxi and limousine fares with pickup/drop-off and distance information between January 2009 and June 2018. This includes data for Yellow cab, Green cab, and for hire vehicles. Just the Yellow cab data from 01/2016 – 06/2018 is over 112,000,000 records (24 GBs) and they download into easy to import comma separated values (CSV) files.

Read on to see how you can set it up yourself.  As Derik points out at the end, though, this is still one big table, but there are a few columns which can lead to dimensions, things like rate code, location, and payment type.

Related Posts

Containers and Data

Randolph West argues that you should keep data and containers separated: Where it gets interesting is that the SQL Server container is also where the database files are stored by default. I raised a point (which Grant and others have already noted in the past) that persisted storage volumes allow us to throw away a SQL Server […]

Read More

DBAs in the Cloud

Brent Ozar argues that production DBAs will still be important even at cloud-only companies: One of my favorite recent examples was a company who came to me saying, “We’re spending about $2M per year in the cloud just on our databases alone. Can you help us reduce those costs?” Absolutely: with just a couple of […]

Read More

Categories

October 2018
MTWTFSS
« Sep Nov »
1234567
891011121314
15161718192021
22232425262728
293031