What happens if I use data compression and backup compression, do I get double compression?
This is a great question, and without diving too deeply into how backup compression works I’m going to do a simple experiment on the WideWorldImporters database. I’ve restored this database to my local SQL Server 2016 instance and I’m simply going to back it up several times under different conditions.
After restoring the database it’s about 3GB in size, so our testing will be on a reasonably small database. It would be interesting to see how the results change as the database size increases, perhaps a future blog post.
Click through for the answer.
It’s certainly more difficult to come up with a demo that works without string columns, but consider how the page compression algorithm works. Data can be compressed on page basis, which includes both multiple rows and multiple columns. That means that page compression can achieve a higher compression ratio when a row has identical values in different columns. Columnstore is only able to compress on an individual column basis and you won’t directly see better compression with repeated values in different columns for a single row (as far as I know).
Interestingly, Joe also comes up with a scenario where row-level compression can beat columnstore even without string values. All this said, the normal case when dealing with non-string data is that columnstore tends to compress a lot better.
Do not forget about the certificate! Warning: The certificate used for encrypting the database encryption key has not been backed up. Imagine if you need to recover the backup and you can’t? You will get the dreaded thumbprint error.
Msg 33111, Level 16, State 3, Line 25 Cannot find server certificate with thumbprint ‘0x78FAB5A2A5D593FD3C4E163C90B745F70AB51233’. Msg 3013, Level 16, State 1, Line 25
RESTORE DATABASE is terminating abnormally.
So make sure you respect this certificate (and the key) and back it up and re-create them on the target server for a successful restore.
In SQL Server 2016 and 2017, there’s no reason not to encrypt backups; the marginal cost is practically nil even if you’re low enough on disk space that you need to do backup compression.
We can write procedure check periodically rather a table will benefit from compression or not. There are a few tricks though:
- If you have sparse columns you cannot compress the table, we check for that in lines 70-74.
- Next if is a index type of 1 (a clustered index) it cannot contain LOB data types image, ntext, or text. Which these days shouldn’t be a problem, they have been after all deprecated, so you aren’t using them. We check for these in lines 76-84.
- Finally, we perform the same test for LOB data on index type of 2 (nonclustered indexes) in lines 85-95.
If everything checks out OK we call our second procedure to calculate the percentage of space we would space and if we would save more than 25% then we create script to compression the index. Now we have a @Debug parameter in here so it can just show us what code it would execute to determine what it will do.
Click through for the code for the two procedures Tracy has created.
While this data is interesting, I’m more curious about how compression affects my everyday SELECT queries. I have a set of three stored procedures that each have one SELECT query, so that each index is used. I created these procedures for each table, and then wrote a script to pull values for first and last names to use for testing. Here is the script to create the procedures.
Once we have the stored procedures created, we can run the script below to call them. Kick this off and then wait a couple minutes…
To me, the COMPRESS function is most useful for compressing information you tend not to search through but need to keep the in the database, like HTML markup or long descriptions.
Page Compression is what I like to refer to as “compression for real this time” as it goes well beyond the smart storage method of row and uses patterns/repeating values to condense the stored data.
First, to gain a better understanding of this method, check out a simple representation of a page of data. This is illustrated below in Figure 1. You’ll notice that there are some repeating values (e.g. SQLR) and some repeated strings of characters (e.g. SSSLL).
I really appreciate getting an idea of what kind of data does not compress well. You’d think auto-incrementing numbers would be another scenario, but Melissa explains how that’s not necessarily the case.
COMPRESS()function takes an input of string or binary data, and applies the gzip algorithm to it. It returns a value of type
varbinary(max). In essence, instead of storing string or binary data, you can gzip it up and store it in a varbinary(max) column. There’s also a
DECOMPRESS()function for when you are reading the data and need to unzip it.
This costs some CPU, but gzip can save quite a bit of space. How much space, and whether it’s worth the CPU cost will vary depending on your data and workload. In this blog post, we’ll take a look at one table. We’ll look at the space savings we get out of using
COMPRESS(), and we’ll look at the effort necessary to implement it.
Read on for Andy’s test and thoughts.
That first one is a cracker – it hit me once when compressing a SQL Server table (600M+ rows) on a 64 core Enterprise SQL Server. After benchmarking several other data compression activities I thought I had a basic “rule of thumb” (based on GB data size and number of rows)… of which just happened to be coincidence!
This also begs the question of why would you use low selectivity indexes? Well I can think of a few cases – but the one which stands out the most is the identification of a small number of rows within a greater collection – such as an Index on TYPE columns (ie; [ProcessingStatusFlag] CHAR(1) = [P]rocessed, [U]nprocessed, [W]orking, [F]ailed, etc)
… AND SO – lets do some testing to validate this puppy!
There’s a significant difference here, so check out Rolf’s post for the details.
You can think of page compression as doing data deduplication within a page. If there is some value repeated in multiple spots on a page, then page compression can store the repetitive value only once, and save some space.
Page compression is actually a process that combines three different compression algorithms into a bigger algorithm. Page compression applies these three algorithms in order:
1) Row compression
2) Prefix compression
3) Dictionary compression
Page compression is my go-to compression option, typically. There are some cases in which it doesn’t work well, so check beforehand (start with sp_estimate_data_compression_savings), but I’ve had good luck with page compression.
You can think of row compression as working by treating certain fixed-length data types as variable-length data types. By removing certain metadata, NULL and 0 values, and the padding of fixed-length values, SQL Server can reduce the total size of a row.
The easiest way to think of it is that char(n) no longer takes n bytes for every row, but instead gets treated more like varchar(n) where the storage used varies for each value. The behavior for each data type varies, with some data types getting more or less (or no) savings compared to others.
Row-level compression is the “safer” of the two primary compression options, but I almost never use it. That might just be a function of the my particular workloads, of course.