R Model Compression

Kevin Feasel

2017-07-20

R

I have a post showing off some of the value of compressing R models:

So right now, we’re burning roughly 200K per model.  My stated goal is to be able to store several years worth of data for 10 million products.  Let’s say that I need 10 million products in ProductModel and 1 billion rows in ProductModelHistory.  That means that we’d end up with 1.86 TB of data in the ProductModel table and 186 TB in ProductModelHistory.  This seems…excessive.

As a result, I decided to try using the COMPRESS() function in SQL Server 2016.  The COMPRESS function simply uses GZip compression.  Yeah, there are compression algorithms which tend to be more compact (e.g., bz2 or 7z), but GZip is relatively CPU efficient and I can wrap my SQL statements with COMPRESS() and DECOMPRESS() and not have to change any calling code; I just need to update the two stored procedures I use to insert and then retrieve product models.

Most of the time, it’s not a big deal.  But once you start talking hundreds of gigabytes or in my case, a couple hundred terabytes, it’s definitely worth compressing this data.

Related Posts

Vectors for Programmers

John Mount has a couple of videos available: We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material. Click through […]

Read More

Defining Tidy Data

John Mount shares thoughts about the concept of tidy data: A question is: is such a data set “tidy”? The paper itself claims the above definitions are “Codd’s 3rd normal form.” So, no the above table is not “tidy” under that paper’s definition. The the winner’s date of birth is a fact about the winner […]

Read More

Categories

July 2017
MTWTFSS
« Jun Aug »
 12
3456789
10111213141516
17181920212223
24252627282930
31