R Model Compression

Kevin Feasel



I have a post showing off some of the value of compressing R models:

So right now, we’re burning roughly 200K per model.  My stated goal is to be able to store several years worth of data for 10 million products.  Let’s say that I need 10 million products in ProductModel and 1 billion rows in ProductModelHistory.  That means that we’d end up with 1.86 TB of data in the ProductModel table and 186 TB in ProductModelHistory.  This seems…excessive.

As a result, I decided to try using the COMPRESS() function in SQL Server 2016.  The COMPRESS function simply uses GZip compression.  Yeah, there are compression algorithms which tend to be more compact (e.g., bz2 or 7z), but GZip is relatively CPU efficient and I can wrap my SQL statements with COMPRESS() and DECOMPRESS() and not have to change any calling code; I just need to update the two stored procedures I use to insert and then retrieve product models.

Most of the time, it’s not a big deal.  But once you start talking hundreds of gigabytes or in my case, a couple hundred terabytes, it’s definitely worth compressing this data.

Related Posts

Linear Discriminant Analysis

Jake Hoare explains Linear Discriminant Analysis: Linear Discriminant Analysis takes a data set of cases (also known as observations) as input. For each case, you need to have a categorical variable to define the class and several predictor variables (which are numeric). We often visualize this input data as a matrix, such as shown below, with each case being a row and each variable a column. In this […]

Read More

Azure Data Lake Store File Management With httr

Leila Etaati shows how to generate RESTful statements in R using httr: In this post, I am going to share my experiment in how to do file management in ADLS using R studio, to do this you need to have below items 1. An Azure subscription 2. Create an Azure Data Lake Store Account 3. […]

Read More


July 2017
« Jun Aug »