Press "Enter" to skip to content

Category: Columnstore

Bringing Order to a Columnstore Index

Tibor Karaszi puts columnstore ducks in a row:

Data for a columnstore index is divded in groups of approximate 1 million rows, rowgroups. Each rowgroup has a set of pages for each column. The set of pages for a column in a rowgroup is called a segment. SQL Server has meta-data for the lowest and highest value for a segment. There are no SEEKs in a columnstore index. But, SQL Server can use this meta-data to skip reading segments, with the knowledge that “this segment cannot contain any data that I need based on my predicates in my WHERE clause”.

Also, you might want to do these operations using MAXDOP 1, so we don’t have several threads muddling our neat segment alignment.

I’m not sure I actually set the ORDER BY clause on columnstore indexes all that often—a quick mental survey says maybe once, though that could be my own failing rather than a statement on the utility of ordered columnstore indexes.

Comments closed

Ordered Columnstore Indexes

Joe Obbish and Erik Darling tag team on this one. First, Joe looks at the details of what the CCI ordering process does:

The sort for inserting into an ordered columnstore is a DML request sort. It appears to use the same internal mechanism as the sort that’s added for inserting into partitioned columnstore tables. The difference is that the data is sorted by the specified columns instead of a calculated partition id. In my testing, the sort appears to be a best effort sort that does not spill to tempdb. This means that if SQL Server thinks there won’t be enough memory then the data will not be fully sorted. Parallel inserts have an additional complication. 

And Erik has a messy work-around:

Anyway, I decided to dig in and see what was going on behind the scenes. Which of course, means query plans, and bothering people who are really good at debuggers.

Most of the problems that you’ll run into in SQL Server will come from sorting data.

Whenever I have to think about Sorts, I head to this post about all the different Sorts you might see in a query plan.

Definitely read both posts.

Comments closed

Ordered Columnstore Indexes in SQL Server 2022

Brent Ozar appreciates order:

So essentially, every column has a whole bunch of indexes on it.

But there’s no order whatsoever as to which rows end up in which index.

This isn’t a problem for relatively small tables, but as you get to billion-row data warehouse fact tables where columnstore should really shine, performance gradually degrades. In data warehouses, fact tables often have a commonly filtered column, like SaleDate. However, until SQL Server 2022, even if you wanted a small SaleDate range, your query would likely check hundreds or thousands of row groups, each of which had a huge range of data.

But do read the whole thing, as it seems it’s not working correctly in CTP 2.0 of SQL Server 2022. It is quite useful in Azure Synapse Analytics dedicated SQL pools, at least—that I can confirm.

Comments closed

Diving into Vertipaq Compression

Ed Pollack explains how Vertipaq compression works to make columnstore indexes so efficient:

Columnstore compression is an impressive array of algorithms that can take large analytic tables and significantly reduce their storage footprint. In doing so, IO is also reduced, and query performance dramatically improved.

This article dives into one aspect of columnstore compression that tends to get buried in all of the hoopla surrounding how awesome columnstore indexes are: Vertipaq optimization. This is a critical component of the columnstore compression process, and understanding how it works can significantly improve the performance of analytic workloads while reducing the computing resources required for the underlying data.

Click through for the steps of the process.

Comments closed

Serializing Clustered Columnstore Index Deletes

Aaron Bertrand hit a wall:

At Stack Overflow, we have some tables using clustered columnstore indexes, and these work great for the majority of our workload. But we recently came across a situation where “perfect storms” — multiple processes all trying to delete from the same CCI — would overwhelm the CPU as they all went widely parallel and fought to complete their operation. Here’s what it looked like in SolarWinds® SQL Sentry®:

It looks bad. Click through to understand why and what Aaron & co did to prevent this issue. I typically have used queue tables on the other end of columnstore indexes: as a method for ensuring that we insert 1024*1024 rows at a time. This was particularly important in the 2016 days, as we had a problem in which trickle-loading a columnstore index would cause massive numbers of rowgroups with dozens of rows, though that issue was subsequently fixed.

Comments closed

Locking Issue with Columnstore Indexes

Joe Obbish troubleshoots an issue on tables with columnstore indexes:

I recently ran into a production issue where a SELECT query that referenced a NOLOCK-hinted table was hitting a 30 second query timeout. Query store wait stats suggested that the issue was blocking on a table with a nonclustered columnstore index (NCCI). This was quite unexpected to me and I was eventually able to produce a reproduction of the issue. I believe this to be a bug in SQL Server that’s present in both RTM and the current CU as of this blog post (CU14). The issue also impacts CCIs as well but I did significantly less testing with that index type.

Read on for the issue, how you can replicate it, and a couple ways to work around it.

Comments closed

Handling Tombstones in Cassandra

Payal Kumari takes us through tombstone management in Apache Cassandra:

Got too many tombstones? This blog post will talk about how to deal with tombstones once you already have them. For more information about tombstones, check out this post: Examining the Lifecycle of Tombstones in Apache Cassandra.

Click through for several techniques for handling tombstoned records in Cassandra. In SQL Server, with columnstore indexes, the prevention advice is similar (avoid deletion or updating of data) but the treatment options are quite different.

Comments closed

Persistent Computed Columns and Columnstore Indexes

Erik Darling found a way to do something interesting:

If you read the documentation for column store indexes, it says that column store indexes can’t be created on persisted computed columns.

And that’s true. If we step through this script, creating the column store index will fail.

But it turns out that if there’s a will, there’s a way. Even if this is something you shouldn’t wish to do because who knows what it will mess up.

Comments closed

A Primer on Columnstore Indexes

Gail Shaw gives us an introduction to columnstore indexes:

Columnstores are… different.

The first, and I would say most important thing to realise about columnstore indexes is that they don’t have keys. These are not seekable indexes. Any access to a columnstore index is going to be a scan.

Instead of storing the rows together on a page, a columnstore index instead stores column values together. The rows in the table are divided into chunks of max a million rows, called a row group, and the columns are then stored separately, in what are called segments. A segment will only ever contain one column’s values.

Read the whole thing.

Comments closed