Press "Enter" to skip to content

Curated SQL Posts

Taxi Cab Data On sqlite And Parquet

Mark Litwintschik loads the 1.1 billion rows of New York City taxi data into a SQLite database using data stored on Parquet-formatted files living on HDFS:

The dataset used in this benchmark has 1.1 billion records, 51 columns and is 500 GB in size when in uncompressed CSV format. Instructions on producing the dataset can be found in my Billion Taxi Rides in Redshift blog post. The CSV files were converted into Parquet format using Hive and Snappy compression on an AWS EMR cluster. The conversion resulted in 56 Parquet files which take up 105 GB of space.

Where decompression is I/O or network bound it makes sense to keep the compressed data as compact as possible. That being said, there are cases where decompression is compute bound and compression schemes like Snappy play a useful role in lowering the overhead.

I’ve downloaded the Parquet files to my local file system and imported them onto HDFS. Since this is all running on a single SSD drive I’ve set the HDFS replication factor to 1.

It’s not the fastest result I’ve seen from Mark’s work, but I was impressed that SQLite could take that abuse.

Comments closed

Studying Performance Problems With Web Applications

Adrian Colyer reviews a paper on fixing performance problems with ORMs:

This is a fascinating study of the problems people get into when using ORMs to handle persistence concerns in their web applications. The authors study real-world applications and distil a catalogue of common performance anti-patterns. There are a bunch of familiar things in the list, and a few that surprised me with the amount of difference they can make. By fixing many of the issues that they find, Yang et al., are able to quantify how many lines of code it takes to address the issue, and what performance improvement the fix delivers.

Much of this is straightforward, but worth the read.

Comments closed

Graphics In R

David Smith is following the kerfuffle that Edward Tufte unleashed on Twitter recently:

While graphics guru Edward Tufte recently claimed that “R coders and users just can’t do words on graphics and typography” and need additonal tools to make graphics that aren’t “clunky”, data journalists at major publications beg to differ. The BBC has been creating graphics “purely in R” for some time, with a typography style matching that of the BBC website. Senior BBC Data Journalist Christine Jeavans offers several examples, including this chart of life expectancy differences between men and women:

I think Tufte’s off base here.

Comments closed

Just Update Those Servers Already

Randolph West wants none of your excuses:

Folks, we all like to make sure we’re doing our level best to make things work smoothly.

So why am I staring at someone’s server that has never been updated since it was first set up almost three years ago?

Do better, so that I don’t have to yell at you. Seriously.

When we ignore updates, we are ignoring preventable catastrophic problems; we are ignoring fixes to security bugs, performance bugs, and data corruption bugs. Each one of these things could give you a really bad day. In two out of three cases it might even be a career-limiting move.

There are risks to patching servers, but the downside risk tends to be much larger and administrators need to be able to mitigate update risks through redundancy, automation, and having a rollback plan if needed.  It’s more work than not patching, but the outcome is much better.

Comments closed

Scaling Azure Analysis Services

Chris Seferlis helps us make the decision between scaling up or out with Azure Analysis Services:

Some of you may not know when or how to scale up your queries or scale out your processing. Today I’d like to help with understanding when and how using Azure Analysis Services. First, you need to decide which tier you should be using. You can do that by looking at the QPUs (Query Processing Units) of each tier on Azure. Here’s a quick breakdown:

  • Developer Tier – gives you up to 20 QPUs

  • Basic Tier – is a mid-scale tier, not meant for heavy loads

  • Standard Tier (currently the highest available) – allows you more capability and flexibility

Read on for some pointers.

Comments closed

OFFSET – FETCH Versus ROWNUM In Oracle

Lukas Eder compares the OFFSET FETCH logic versus using ROWNUM for grabbing an ordered sub-selection of rows in Oracle:

Now, while the SQL transformation from FETCH FIRST to ROW_NUMBER() filtering is certainly correct, the execution plan doesn’t really make me happy. Consider the ROWNUM based query:

---------------------------------------------------------
| Id  | Operation                     | Name    | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT              |         |       |
|*  1 |  COUNT STOPKEY                |         |       |
|   2 |   VIEW                        |         |     1 |
|   3 |    TABLE ACCESS BY INDEX ROWID| FILM    |  1000 |
|   4 |     INDEX FULL SCAN           | PK_FILM |     1 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter(ROWNUM=1)

And compare that to the FETCH FIRST query:

-------------------------------------------------
| Id  | Operation                | Name | Rows  |
-------------------------------------------------
|   0 | SELECT STATEMENT         |      |       |
|*  1 |  VIEW                    |      |     1 |
|*  2 |   WINDOW SORT PUSHED RANK|      |  1000 |
|   3 |    TABLE ACCESS FULL     | FILM |  1000 |
-------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=1)
   2 - filter(ROW_NUMBER() OVER ( ORDER BY "FILM"."FILM_ID")<=1)

Lukas digs into this and is not the biggest fan of OFFSET-FETCH.  On the SQL Server side, my anecdotal experience has been that it doesn’t perform nearly as well as you’d like either.

Comments closed

Counting Arguments In R

Neil Saunders shares methods for interrogating argument lists in R:

“Some R functions have an awful lot of arguments”, you think to yourself. “I wonder which has the most?”

It’s not an original thought: the same question as applied to the R base package is an exercise in the Functions chapter of the excellent Advanced R. Much of the information in this post came from there.

There are lots of R packages. We’ll limit ourselves to those packages which ship with R, and which load on startup. Which ones are they?

It’s a fun exercise and helpful for learning a bit more about how to work with arguments when metaprogramming in R.

Comments closed

Analyzing Federal Reserve Data With Ordinary Least Squares

Sam Shum has a tutorial walking us through extracting and analyzing data from the St. Louis Federal Reserve’s FRED economic database:

Download specific macroeconomic data from FRED St. Louis economic databases and ETL the data. Many other data series can be found at the FRED’s website.

# get unemployment data time series from FRED St. Louis
dfunrate <- get_fred_series("UNRATE", "unrate", observation_start = startdate, observation_end = enddate)

# get University of Michigan consumer sentiment index data time series from FRED St. Louis
dfumcsent <- get_fred_series("UMCSENT", "umcsent", observation_start = startdate, observation_end = enddate)

# combine the two time series data into one data frame
dfall <- cbind(dfunrate,dfumcsent)

# strip or remove redundant month field from data downloaded from FRED St. Louis
dfall <- dfall[,c(1,2,4)]

# obtain the number of data points in the dataframe
mdx <- (1:nrow(dfall))  

# convert FRED date field from string to R's date type
dfall$date <- as.Date(dfall$date)

There’s a nice chart builder on the FRED website too, but it’s good to be able to grab the data on your own.

Comments closed