While this data is interesting, I’m more curious about how compression affects my everyday SELECT queries. I have a set of three stored procedures that each have one SELECT query, so that each index is used. I created these procedures for each table, and then wrote a script to pull values for first and last names to use for testing. Here is the script to create the procedures.
Once we have the stored procedures created, we can run the script below to call them. Kick this off and then wait a couple minutes…
To me, the COMPRESS function is most useful for compressing information you tend not to search through but need to keep the in the database, like HTML markup or long descriptions.
Page Compression is what I like to refer to as “compression for real this time” as it goes well beyond the smart storage method of row and uses patterns/repeating values to condense the stored data.
First, to gain a better understanding of this method, check out a simple representation of a page of data. This is illustrated below in Figure 1. You’ll notice that there are some repeating values (e.g. SQLR) and some repeated strings of characters (e.g. SSSLL).
I really appreciate getting an idea of what kind of data does not compress well. You’d think auto-incrementing numbers would be another scenario, but Melissa explains how that’s not necessarily the case.
COMPRESS()function takes an input of string or binary data, and applies the gzip algorithm to it. It returns a value of type
varbinary(max). In essence, instead of storing string or binary data, you can gzip it up and store it in a varbinary(max) column. There’s also a
DECOMPRESS()function for when you are reading the data and need to unzip it.
This costs some CPU, but gzip can save quite a bit of space. How much space, and whether it’s worth the CPU cost will vary depending on your data and workload. In this blog post, we’ll take a look at one table. We’ll look at the space savings we get out of using
COMPRESS(), and we’ll look at the effort necessary to implement it.
Read on for Andy’s test and thoughts.
That first one is a cracker – it hit me once when compressing a SQL Server table (600M+ rows) on a 64 core Enterprise SQL Server. After benchmarking several other data compression activities I thought I had a basic “rule of thumb” (based on GB data size and number of rows)… of which just happened to be coincidence!
This also begs the question of why would you use low selectivity indexes? Well I can think of a few cases – but the one which stands out the most is the identification of a small number of rows within a greater collection – such as an Index on TYPE columns (ie; [ProcessingStatusFlag] CHAR(1) = [P]rocessed, [U]nprocessed, [W]orking, [F]ailed, etc)
… AND SO – lets do some testing to validate this puppy!
There’s a significant difference here, so check out Rolf’s post for the details.
You can think of page compression as doing data deduplication within a page. If there is some value repeated in multiple spots on a page, then page compression can store the repetitive value only once, and save some space.
Page compression is actually a process that combines three different compression algorithms into a bigger algorithm. Page compression applies these three algorithms in order:
1) Row compression
2) Prefix compression
3) Dictionary compression
Page compression is my go-to compression option, typically. There are some cases in which it doesn’t work well, so check beforehand (start with sp_estimate_data_compression_savings), but I’ve had good luck with page compression.
You can think of row compression as working by treating certain fixed-length data types as variable-length data types. By removing certain metadata, NULL and 0 values, and the padding of fixed-length values, SQL Server can reduce the total size of a row.
The easiest way to think of it is that char(n) no longer takes n bytes for every row, but instead gets treated more like varchar(n) where the storage used varies for each value. The behavior for each data type varies, with some data types getting more or less (or no) savings compared to others.
Row-level compression is the “safer” of the two primary compression options, but I almost never use it. That might just be a function of the my particular workloads, of course.
Reader Dick H. posted a comment on my last version of this script stating that he got an error when this was run against tables containing sparse columns. Data compression does not support tables with sparse columns, so they should be excluded from this process. I’ve modified this script to correct this. I don’t have any tables with sparse columns in my environment, so thanks to Dick for pointing this out!
For instructions on using this script, look here.
This is a very useful script to have in your back pocket.
The obvious benefit is that compressed data takes up less space on disk. Since you probably keep multiple copies of your database (multiple environments, DR, backups, etc), this space savings can really add up. High-performance enterprise-class storage is expensive. Compressing your data to reduce footprint can have a very real benefit to your budget. I once worked on an SAP ERP database that was 12TB uncompressed, and was reduced to just under 4TB after we implemented compression.
My experience with compression is that the benefit vastly outweighs the cost. Do your own testing, of course.
Before we jump right into enabling either row or page compression, we can actually estimate the savings of each to determine which will provide us with the most savings on storage. Since page compression includes row compression, we will start with row compression and the estimated savings.
In practice, data compression is extremely valuable and in most circumstances, the benefits outweigh the costs. In certain workloads, you might even see CPU usage go down.
There might be a theme to today’s posts…
Power Pivot would end up storing a table that looks more like the black table above (rather than the blue one), keeping just the minimum amount of information it needs to rebuild the real table of data on the fly when and if required. If the black RLE table ended up taking more space than the original column of data, then there would be no benefit of RLE and the original column of data would be stored. Power Pivot may use one or more of the other compression techniques used as well as, or instead of RLE – it all depends on the specifics of the actual data.
This is a very interesting look at ways the Power Pivot team optimize data storage.