Category: Columnstore

Columnstore and a Scheduler per Partition

Published 2020-02-04 by Kevin Feasel

Aaron Bertrand continues looking at columnstore indexing:

The idea is I would have eight instances of an application (SQLQueryStress) that would each run on a dedicated scheduler, handling only the data destined for a specific partition / filegroup / data file, ~100 million rows at a time

It was interesting seeing how to pull that off. Aaron also has some extended findings, so check it out.

Comments closed

Partitioning on Columnstore Table Loading

Published 2020-01-09 by Kevin Feasel

Aaron Bertrand continues a series around learning about columnstore indexes:

In part 1, I showed how both page and columnstore compression could reduce the size of a 1TB table by 80% or more. While I was impressed I could shrink a table from 1TB to 50GB, I wasn’t very happy with the amount of time it took (anywhere from 2 to 14 hours). With some tips graciously borrowed from folks like Joe Obbish, Lonny Niederstadt, Niko Neugebauer, and others, in this post I will try to make some changes to my original attempt to get better load performance. Since the regular columnstore index didn’t compress better than page compression on this data set, and took 13 hours longer to get there, I’ll focus solely on the more advanced solution using COLUMNSTORE_ARCHIVE compression.

Click through for part 2.

Comments closed

Columnstore Indexes in Azure SQL Database

Published 2019-12-31 by Kevin Feasel

Niko Neugebauer takes us through the columnstore offerings available in Azure SQL Database:

Almost 2 years ago (22nd of March 2018) in Columnstore Indexes – part 121 (“Columnstore Indexes on Standard Tier of Azure SQL DB”) I have already mentioned that Columnstore Indexes were available in Azure SQL Database in Standard 3 (S3) edition and higher, while people I meet keep on mentioning and believing that in order to get Columnstore Indexes one needs to use Premium editions.
Since that blog post a lot of time has passed and in the mean time we have got new tiers with new generations of provisioned General Purpose tiers (Generation 4, Generation 5, FSv2 Series & M Series) appearing, plus the Serverless Tier and not to forget the very promising Hyperscale tier … besides the Azure SQL Database Managed Instance of course, which has already been generally available for some time and the good old Elastic Pools which were never mentioned in original article.

It sounds like, on the whole, columnstore is a normal part of Azure SQL Database across the board—it’s not a special add-on feature.

Comments closed

Columnstore Versus Page Compression

Published 2019-12-27 by Kevin Feasel

Aaron Bertrand compares columnstore and page compression on a specific table:

Recently someone at work asked for more space to accommodate a rapidly growing table. At the time it had 3.75 billion rows, presented on 143 million pages, and occupying ~1.14TB. Of course we can always throw more disk at a table, but I wanted to see if we could scale this more efficiently than the current linear trend. Sounds like a great job for compression, right? But I also wanted to try out some other solutions, including columnstore – which people are surprisingly reluctant to try. I am no Niko, but I wanted to make an effort to see what it could do for us here.
Note that I am not focusing on reporting workload or other read query performance at this time – I merely want to see what impact I can have on storage (and memory) footprint of this data.
Here is the original table. I’ve changed table and column names to protect the innocent, but everything else is relatively accurate.

Page compression won, and I’ve got a pretty good idea why (though some of the diagnostic info is gone): Aaron has several VARCHAR and NVARCHAR columns, and those blow up the columnstore dictionary pretty fast. Aaron has more to go in this series, so stay tuned.

Comments closed

Why So Few Columnstore Indexes Around?

Published 2019-12-18 by Kevin Feasel

Grant Fritchey has a bit of a rant about people not using Columnstore indexes as much as they should:

It was already common knowledge that columnstore indexes didn’t work for most of us.
Fact is, that’s not true. Now that we have clustered columnstore and non-clustered columnstore, you can go nuts. Most of your data access is through analytical channels? Awesome, use a clustered columnstore. Sometimes though, you need point lookups. Not a problem, add a nonclustered b-tree index to the clustered columnstore. Go here to learn more about Columnstore Indexes.
In short, today, we can completely orient our data storage with our principal data access. Yet, most people are not using these things at all.

One of my interview questions is about columnstore indexes. I’ve learned that I needed to preface it with “What’s the latest version of SQL Server you’ve worked with?” A lot of people answer 2012. Even among the people who use 2016, the normal answer is that they haven’t learned about columnstore yet. And that goes back to Grant’s learning gap: it’s not that hard to grab a book on SQL Server 2019, spin up a Docker container, and dive in. Or watch a course, spin up a Docker container, and follow along. Or read a blog post, spin up a Docker container, and…well, you get the idea.

Comments closed

Get the Stack Overflow Columnstore Edition Database

Published 2019-11-15 by Kevin Feasel

Erik Darling has more for you on Stack Overflow + Columnstore:

If you want to download the database, here’s the magnet link for the torrent. I don’t have another means of distributing this; it’s too big of a file.
If you want the GitHub scripts to create and load data, head over here.

In addition, Erik has some quick queries showing table relationships in a world without foreign key constraints:

To get you started exploring the Stack Overflow column store database, here are some queries that show how tables are related.
The two main relationships are User Id, and Post Id.

Quick side note: joining together large columnstore indexed tables? Generally not the best idea.

Comments closed

Stack Overflow DB, Columnstore Edition

Published 2019-11-13 by Kevin Feasel

Erik Darling has started a new series. Part one is the intro:

I really wanted a version of the Stack Overflow data dump that was all clustered column store. In SQL Server 2016, that didn’t really work because of restrictions around MAX data types. In 2017 it did, but… If your data warehouse has a bunch of max data type columns, that’s bad and you should feel bad.
The problem here is that once you drop out the “big” columns (AboutMe from Users, Text from Comments, Body from Posts), the entire ~300GB database compressed down to about 6GB. That means if we want a realistically sized data warehouse, we’d need a “Make Big” script, like people used to use for Adventure Works before it went out of business.

Part 2, like a noble spirit, embiggens the smallest man:

One thing I’d love feedback on is advancing dates. Right now, the script doesn’t do that at all. I thought of different ways to handle it, but didn’t like any of them, mostly because of how it might mess with the partitioning function I’m using. I felt like I was overthinking it quite a bit, and decided to leave dates as-is, and only increment User and Post Ids.
A quick note: This script assumes that a database called StackOverflow will be the source of the loads. If you need to use a different version, that’s a manual change. I didn’t want to go down the dynamic SQL route here until I gauged popularity.

If you want to play along at home, you can grab the data dump from archive.org.

Comments closed

Data Retrieval Bug Fixed for Columnstore Indexes

Published 2019-10-11 by Kevin Feasel

Dmitri Korotkevich takes us through an important bugfix in SQL Server:

The typical columnstore table is usually large and contains hundreds of millions or even billions of rows. Think about large fact tables in the data warehouses or huge transactional tables in OLTP systems. Those tables are usually partitioned. Besides usual reasons (Availability, Maintainability, etc), partitioning helps with the data load – it is easier to perform ETL in the staging table and import data through partition switch.
And here comes the problem. If you run OLTP query against partitioned clustered columnstore table and end up with the execution plan that uses index intersection of nonclustered B-Tree indexes, you may get incorrect results.

Getting the correct results in a query is pretty important (he says, with understatement), so this is an important bugfix; keep those SQL Server instances patched accordingly.

Comments closed

Ordered Clustered Columnstore Indexes in Azure SQL DW

Published 2019-10-07 by Kevin Feasel

Niko Neugebauer takes us through a new feature in preview for Azure SQL Data Warehouse:

After creating (or dropping and recreating a Clustered Columnstore Index we can specify the reserved word ORDER and then one or !!!MULTIPLE!!! columns. This looks like an extremely promising feature!
On Azure SQL Data Warehouse one can of course define table as a Columnstore and with that specification it is also possible to define an ORDER option with one or multiple columns.
For the syntax and basic functionality testing purposes on Azure SQL Data Warehouse, let us then create a table with a Clustered Columnstore Index, load some data and see if by recreating an Ordered Clustered Columnstore Index we can achieve some improvements.

Niko has a few hard-earned lessons from this post.

Comments closed

Batch Mode Normalization

Published 2019-09-10 by Kevin Feasel

Paul White digs into batch mode normalization and its consequences for performance:

I mentioned in the introduction that not all eight-byte data types can fit in 64 bits. This fact is important because many columnstore and batch mode performance optimizations only work with data 64 bits in size. Aggregate pushdown is one of those things. There are many more performance features (not all documented) that work best (or at all) only when the data fits in 64 bits.
In our specific example, aggregate pushdown is disabled for a columnstore segment when it contains even one data value that does not fit in 64 bits. SQL Server can determine this from the minimum and maximum value metadata associated with each segment without checking all the data. Each segment is evaluated separately.

Paul goes deep into the concept, making this well worth your while.

Comments closed