Press "Enter" to skip to content

Category: Columnstore

Columnstore Indexes in Azure SQL Database

Niko Neugebauer takes us through the columnstore offerings available in Azure SQL Database:

Almost 2 years ago (22nd of March 2018) in Columnstore Indexes – part 121 (“Columnstore Indexes on Standard Tier of Azure SQL DB”) I have already mentioned that Columnstore Indexes were available in Azure SQL Database in Standard 3 (S3) edition and higher, while people I meet keep on mentioning and believing that in order to get Columnstore Indexes one needs to use Premium editions.

Since that blog post a lot of time has passed and in the mean time we have got new tiers with new generations of provisioned General Purpose tiers (Generation 4, Generation 5, FSv2 Series & M Series) appearing, plus the Serverless Tier and not to forget the very promising Hyperscale tier … besides the Azure SQL Database Managed Instance of course, which has already been generally available for some time and the good old Elastic Pools which were never mentioned in original article.

It sounds like, on the whole, columnstore is a normal part of Azure SQL Database across the board—it’s not a special add-on feature.

Comments closed

Columnstore Versus Page Compression

Aaron Bertrand compares columnstore and page compression on a specific table:

Recently someone at work asked for more space to accommodate a rapidly growing table. At the time it had 3.75 billion rows, presented on 143 million pages, and occupying ~1.14TB. Of course we can always throw more disk at a table, but I wanted to see if we could scale this more efficiently than the current linear trend. Sounds like a great job for compression, right? But I also wanted to try out some other solutions, including columnstore – which people are surprisingly reluctant to try. I am no Niko, but I wanted to make an effort to see what it could do for us here.

Note that I am not focusing on reporting workload or other read query performance at this time – I merely want to see what impact I can have on storage (and memory) footprint of this data.

Here is the original table. I’ve changed table and column names to protect the innocent, but everything else is relatively accurate.

Page compression won, and I’ve got a pretty good idea why (though some of the diagnostic info is gone): Aaron has several VARCHAR and NVARCHAR columns, and those blow up the columnstore dictionary pretty fast. Aaron has more to go in this series, so stay tuned.

Comments closed

Why So Few Columnstore Indexes Around?

Grant Fritchey has a bit of a rant about people not using Columnstore indexes as much as they should:

It was already common knowledge that columnstore indexes didn’t work for most of us.

Fact is, that’s not true. Now that we have clustered columnstore and non-clustered columnstore, you can go nuts. Most of your data access is through analytical channels? Awesome, use a clustered columnstore. Sometimes though, you need point lookups. Not a problem, add a nonclustered b-tree index to the clustered columnstore. Go here to learn more about Columnstore Indexes.

In short, today, we can completely orient our data storage with our principal data access. Yet, most people are not using these things at all.

One of my interview questions is about columnstore indexes. I’ve learned that I needed to preface it with “What’s the latest version of SQL Server you’ve worked with?” A lot of people answer 2012. Even among the people who use 2016, the normal answer is that they haven’t learned about columnstore yet. And that goes back to Grant’s learning gap: it’s not that hard to grab a book on SQL Server 2019, spin up a Docker container, and dive in. Or watch a course, spin up a Docker container, and follow along. Or read a blog post, spin up a Docker container, and…well, you get the idea.

Comments closed

Get the Stack Overflow Columnstore Edition Database

Erik Darling has more for you on Stack Overflow + Columnstore:

If you want to download the database, here’s the magnet link for the torrent. I don’t have another means of distributing this; it’s too big of a file.

If you want the GitHub scripts to create and load data, head over here.

In addition, Erik has some quick queries showing table relationships in a world without foreign key constraints:

To get you started exploring the Stack Overflow column store database, here are some queries that show how tables are related.

The two main relationships are User Id, and Post Id.

Quick side note: joining together large columnstore indexed tables? Generally not the best idea.

Comments closed

Stack Overflow DB, Columnstore Edition

Erik Darling has started a new series. Part one is the intro:

I really wanted a version of the Stack Overflow data dump that was all clustered column store. In SQL Server 2016, that didn’t really work because of restrictions around MAX data types. In 2017 it did, but… If your data warehouse has a bunch of max data type columns, that’s bad and you should feel bad.

The problem here is that once you drop out the “big” columns (AboutMe from Users, Text from Comments, Body from Posts), the entire ~300GB database compressed down to about 6GB. That means if we want a realistically sized data warehouse, we’d need a “Make Big” script, like people used to use for Adventure Works before it went out of business.

Part 2, like a noble spirit, embiggens the smallest man:

One thing I’d love feedback on is advancing dates. Right now, the script doesn’t do that at all. I thought of different ways to handle it, but didn’t like any of them, mostly because of how it might mess with the partitioning function I’m using. I felt like I was overthinking it quite a bit, and decided to leave dates as-is, and only increment User and Post Ids.

A quick note: This script assumes that a database called StackOverflow will be the source of the loads. If you need to use a different version, that’s a manual change. I didn’t want to go down the dynamic SQL route here until I gauged popularity.

If you want to play along at home, you can grab the data dump from archive.org.

Comments closed

Data Retrieval Bug Fixed for Columnstore Indexes

Dmitri Korotkevich takes us through an important bugfix in SQL Server:

The typical columnstore table is usually large and contains hundreds of millions or even billions of rows. Think about large fact tables in the data warehouses or huge transactional tables in OLTP systems. Those tables are usually partitioned. Besides usual reasons (Availability, Maintainability, etc), partitioning helps with the data load – it is easier to perform ETL in the staging table and import data through partition switch.

And here comes the problem. If you run OLTP query against partitioned clustered columnstore table and end up with the execution plan that uses index intersection of nonclustered B-Tree indexes, you may get incorrect results.

Getting the correct results in a query is pretty important (he says, with understatement), so this is an important bugfix; keep those SQL Server instances patched accordingly.

Comments closed

Ordered Clustered Columnstore Indexes in Azure SQL DW

Niko Neugebauer takes us through a new feature in preview for Azure SQL Data Warehouse:

After creating (or dropping and recreating a Clustered Columnstore Index we can specify the reserved word ORDER and then one or !!!MULTIPLE!!! columns. This looks like an extremely promising feature!

On Azure SQL Data Warehouse one can of course define table as a Columnstore and with that specification it is also possible to define an ORDER option with one or multiple columns.

For the syntax and basic functionality testing purposes on Azure SQL Data Warehouse, let us then create a table with a Clustered Columnstore Index, load some data and see if by recreating an Ordered Clustered Columnstore Index we can achieve some improvements.

Niko has a few hard-earned lessons from this post.

Comments closed

Batch Mode Normalization

Paul White digs into batch mode normalization and its consequences for performance:

I mentioned in the introduction that not all eight-byte data types can fit in 64 bits. This fact is important because many columnstore and batch mode performance optimizations only work with data 64 bits in size. Aggregate pushdown is one of those things. There are many more performance features (not all documented) that work best (or at all) only when the data fits in 64 bits.

In our specific example, aggregate pushdown is disabled for a columnstore segment when it contains even one data value that does not fit in 64 bits. SQL Server can determine this from the minimum and maximum value metadata associated with each segment without checking all the data. Each segment is evaluated separately.

Paul goes deep into the concept, making this well worth your while.

Comments closed

Clustered Columnstore and Azure SQL DB

Arun Sirpal takes us through online clustered columnstore index creation in Azure SQL Database:

What tier do you need to create one of these things? Let’s see.

CREATE CLUSTERED  COLUMNSTORE INDEX cciSales ON [SalesLT].[ProductModelProductDescription] WITH ( ONLINE = ON )

But I get this message, Msg 40536, Level 16, State 32, Line 1
‘COLUMNSTORE’ is not supported in this service tier of the database. See Books Online for more details on feature support in different service tiers of Windows Azure SQL Database.

Read on to see the minimum tier which allows online creation of clustered columnstore indexes.

Comments closed

Creating a Columnstore Index

Monica Rathbun shows a scenario where creating a clustered columnstore index can make data retrieval much faster:

Using AdventureworksDW2016CTP3 we will work with the FactResellerSalesXL table which has 11.6 million rows in it. The simple query we will use as a demo just selects the ProductKey and returns some aggregations grouping them by the different product keys.

First, we will run the query with no existing columnstore index and only using the current clustered rowstore (normal) index. Note that I turned on SET STATISTICS IO and TIME on. These two SET statements will help us better illustrate the improvements provided by the columnstore index. SET STATISTICS IO displays statistics on the amount of page activity generated by the query. It gives you important details such as page logical reads, physical reads, scans, and lob reads both physical and logical. SET STATISTICS TIME displays the amount of time needed to parse, compile, and execute each statement in the query. The output shows the time in milliseconds for each operation to complete. This allows you to really see, in numbers, the differences.

Click through for the example.

Comments closed