On my VM with 4 cores it takes 33 seconds to execute this query on SQL Server 2016 with Service Pack 1, while it burns almost 48 seconds of the CPU Time.
The relevant part of the execution plan can be found below, showing so many performance problems that this query is suffering, such as INNER LOOP JOIN, INDEX SPOOL, besides even worse part that is actually hidden and is identifiable only once you open the properties of any of the lower tree (left side of the LOOP JOIN), seeing that it all runs with the Row Execution Mode actually.
To show you the problem, on the left side you will find the properties of the sort iterator that is to be found in the lower (left) part of the LOOP Join that was executed around 770.000 times in the Row Execution Mode, effectively taking any chances away from this query to be executed in a fast way. One might argue that it might that it might be more effective to do the loop part in Row Mode, but given that we are sorting around 3.1 Million Rows there – for me there is no doubt that it would be faster to do it within a Batch Execution Mode. Consulting the last sort iterator in the execution plan (TOP N SORT), you will find that it is running with the help of the Batch Execution Mode, even though it is processing around 770.000 rows.
There’s some valuable information here.
Needless to say that looking at the execution plans you notice that the actual execution plan shows 10 times difference between them, even though both tables contain the very same data!
The query cost for the partitioned table is staggering – it is around 10 times bigger (~8.8) vs (~0.81) for the first query.
The execution times reflect in part this situation: 12 ms vs 91 ms. Non-partitioned table performs almost 9 times faster overall and the spent CPU time is reflecting it: 15 ms vs 94 ms. Remember, that both tables are Columnstore Indexes based ! Partitioning your table in a wrong way will contain a huge penalty that might not be directly detectable through the execution plan of the complex queries. Well, you might want to use the CISL, just saying
If you can’t fill a single rowgroup, your partition is too granular. Even then, I’d like to see double-digit rowgroups per partition, though that’s just me.
You may be wondering what is this magic number 900 rows within a batch? Well, when executing a query in BatchMode, SQL Server allocates a 64k bytes structure to group the rows. The number of rows in this structure can vary between 64 to 900 depending upon number of columns selected. For the example above, there are two columns that are referenced and X marks the rows that qualified in the BatchMode structure shown in the picture below. If SCAN is part of a bigger query execution tree, the pointer to this structure is passed to the next operator for further processing. Not all operators can be executed in BatchMode. Please refer to Industry leading analtyics query performance for details on BatchMode Operators.
Under the right circumstances, BatchMode execution can be a major performance benefit.
Sunil Agarwal has a two-part series on columnstore data elimination. First up is column elimination:
Now, let us run the same query on the table with clustered columnstore index as shown in the picture below. Note, that the logical IOs for the LOB data is reduced by 3/4th for the second query as only one column needs to be fetched. You may wonder why LOB? Well, the data in each column is compressed and then is stored as BLOB. Another point to note is that the query with columnstore index runs much faster, 25x for the first query and 4x for the second query.
In the context of rowgroup elimination, let us revisit the previous example with sales data
- You may not even need partitioning to filter the rows for the current quarter as rows are inserted in the SalesDate order allowing SQL Server to pick the rowgroups that contain the rows for the requested date range.
- If you need to filter the data for a specific region within a quarter, you can partition the columnstore index at quarterly boundary and then load the data into each partition after sorting on the region. If the incoming data is not sorted on region, you can follow the steps (a) switch out the partition into a staging table T1 (b) drop the clustered columnstore index (CCI) on the T1 and create clustered btree index on T1 on column ‘region’ to order the data (c) now create the CCI while dropping the existing clustered index. A general recommendation is to create CCI with DOP=1 to keep the prefect ordering.
From these two articles, queries which hit a small percentage of columns and stick to a relatively small number of rowgroups will likely perform better. For people who understand normal B-tree indexes, the second point seems clear enough, but the first point is at least as important.
Looking at the execution plan, SQL Server decided to scan the non-clustered columnstore index, even though it doesn’t contain the computed column BirthYear! This surprised me, because I have a plain old non-clustered index on BirthYear which covers the query as well. I guess the optimizer is really excited about that nonclustered columnstore.
Kendra links to a Connect item from Niko Neugebauer to add persisted computed columns to columnstore indexes.
This adjustment might take place based of the inedequacy of the estimated number of rows (which is based on the statistics available at the time of the execution time generation) related to the real execution number of rows that the iterator/query is processing.
The 2 possible adjustment scenarios are:
– when estimated number of rows is too high and the memory is granted to the query, even though the query itself will not use it.
– when estimated number of rows is too low and the memory operations such as hashing or sorting will not have enough space to fit the complete data sets, thus making them spill on to the TempDB (temporary storing the data while doing the work, based on the lacking of the available memory to the query).
Read on for details, but one interesting caveat is that this doesn’t change anything for the first run; it only updates requests on subsequent runs, so it benefits most from consistent workloads with significant plan re-use. That said, it looks extremely useful.
Turns out the majority of the rows belonged to the second scenario. Whoops. The initial run took a little over 20 hours. Not exactly rocket speed. The problem was that for each period, a large number of rows in the clustered columnstore index (CCI) had to be updated, just to set the range of the interval. Updates in a CCI are expensive, as they are split into inserts and deletes. Doing so many updates resulted in a heavily fragmented CCI and with possibly too many rows in the delta storage (which is row storage).
Read the whole thing. Koen links to a Niko Neugebauer post, which you should also read. After that, read my warning on trickle loading. The major querying benefits you get from clustered columnstore indexes is great, but it does come at a cost when you aren’t simply inserting new rows.
In the upcoming version of SQL Server (for the moment known as SQL Server vNext), Microsoft has finally announced the upcoming support for the LOBs within Columnstore Indexes – thus enabling the usage of the NVARCHAR(MAX), VARCHAR(MAX), and VARBINARY(MAX) data types on the tables with Columnstore Indexes that include those columns.
For the tests, I have decided to spin a Virtual Machine in Azure with an installation of the currently available CTP1 of the SQL Server vNext, which has a version 188.8.131.52.
Read the whole thing. It’s hard to tell at this point if these are bugs, incomplete functionality, or what, so it’ll be interesting to track changes over the CTPs.
Given the improvements and the availability of the of the programability surface for every edition (with some insignificant & logical limitations) that I have blogged about in
SQL Server 2016 SP1 – Programmability Surface for everyone!, I believe everyone using Microsoft Data Platform has rejoyced greatly. Of course, now everyone can have Columnstore Indexes on every SQL Server edition!
There are some noticeable limitations that were announced right from the start, such as the maximum size of the Columnstore Object Pool (you can find more information about it here – Columnstore Indexes – part 38 (“Memory Structures”)), but there are more limitations to the Standard Editions and inferior ones and it is extremely important to know them, to understand them in order to make the right decision – when your Business is ready/needed to upgrade to the Enterprise Edition of the SQL Server.
If you’re on Standard Edition and excited about using Columnstore, do read Niko’s post. Columnstore won’t work as fast as it does on Enterprise Edition (gotta have a reason to upgrade) but based on what he’s shown thus far, Columnstore is still a good reason to upgrade to 2016 SP1 if you’re on Standard Edition.
With an incredible Service Pack 1 Microsoft has triumphantly announced that all editions (Standard, Web, Express and even Local) will get the most advanced programming capabilities of Columnstore, In-Memory, Database Snapshot, Compression, Partition & many others, plus that there are some incredible features for the T-SQL (CREATE OR ALTER) and Execution Plan details (Actual Information on processed Rows, used Trace Flags, etc),
but at the same time there are some quite important improvement under the hood that will make you want to use Service Pack 1 for SQL Server 2016 immediately. One of this features is the fast addition of the Columnstore Index to the Memory-Optimised tables. Let take it to the test by restoring a copy of the ContosoRetailDW free database:
These results look nice.