At this point you can notice, that from our 3 tables there are only 2 are available for the transactional replication, with the disk-based table with a Clustered Columnstore Index is not being available for the replication.
This means that there are no improvements since SQL Server 2014 for the DWH/BI scenarios in this direction and this is definitely sad.
Well, we can always go a different direction, like in the case of Availability Groups in SQL Server 2014, where readable secondaries are supported only for the Nonclustered Columnstore Indexes. In SQL Server 2016 we can use Nonclustered Columnstore Index even on all columns if needed and get the principle improvements for the Batch Execution Mode.
Notice here that even though we can select the InMemory tables with Clustered Columnstore, there are a couple of additional important settings that needs to be configured to make things function. So clicking through the GUI Wizard will not set things correctly up by default.
The short answer, Niko states, is that you can only replicate non-clustered columnstore indexes at this time. I can see some use for replicating clustered columnstore tables (warehouse scale-out scenarios, perhaps), but it wasn’t at the top of my columnstore improvement list.
I stumbled upon this MSDN blog post: SQL Server 2016 SSIS Data Flow Buffer Auto Sizing capability benefits data loading on Clustered Columnstore tables (catchy title). It explains how you can set the buffer properties of the data flow to try to insert data directly into compressed row groups instead of in the delta store. They fail to achieve this using SSIS 2014 and then they explain how using the new AutoAdjustBufferSize property of SSIS 2016 works miracles and everything is loaded directly into compressed row groups. Hint: you want to avoid loading data into the delta store, as it is row storage and you need to wait for the tuple mover to load the data to the CCI in the background.
However, it’s still possible to achieve the same using SSIS 2014 (or earlier). Niko Neugebauer (blog |twitter) shows this in his post Clustered Columnstore Indexes – part 51 (“SSIS, DataFlow & Max Buffer Memory”). It still depends on the estimated row size, but using these settings you should get better results:
This advice is a bit different from loading standard rowstore-based tables, but serves to pack as many rows into each columnstore row group as possible.
In SQL Server 2014 we have had 18 Extended Events and with Service Pack 1 we have received 1 more to be a total of 19 Extended Events for studying the Columnstore Indexes and the Batch Mode processing. In SQL Server 2016 that number has been greatly increased – there are whooping 61 Extended Events, that will give us an important insight into the Columnstore Indexes.
Even more important, Sunil & his team have given an own category inside the Extended Events – a category that is named Columnstore, which will ease the search for the basic columnstore events. Be aware though not all Extended Events related to Columnstore Indexes are included in that category – even including all channels will give you 41 Extended Events, while hiding the other 20 Extended Events, which are sometimes not categorised at all and at other times are stored under different categories, such as Execution or Error, for example. I believe the reason behind not changing the old Extended Events category is quite simple – Microsoft always looks for avoiding breaking existing applications.
There’s a lot here to digest, so read the whole thing.
There is a new line in the properties of the iterator, showing the number of locally aggregated rows and that number equals 619255, that should be exactly the number of rows that is missing from the arrow connecting 2 iterators:
1 select 12008353 + 619255
Gives us our perfect 12627608 rows.
Is there any more information on this operation?
Indeed, just right-click on the Columnstore Index Scan and select it’s properties:
This is tied to some columnstore performance improvements in SQL Server 2016.
Note that this script requires SQL Server 2016 (or later) because the database engine team made some great changes to columnstore indexes, allowing us to use REORGANIZE to clear out deleted rows and compact row groups together, as well as its previous job of marking open delta stores as available for compression.
The code is available as a Gist for now, at least until I decide what to do with it. Comments are welcome, especially if I’m missing a major reorganize condition.
As mentioned, comments are welcome.
I decided to make a serious step back and write about something that is concerning the current (SQL Server 2014) and the elder version of SQL Server that supports Nonclustered Columnstore Indexes – (SQL Server 2012).
The Nonclustered Columnstore Indexes in SQL Server 2012 & 2014 are non-updatable, meaning that after they are built on the table, you cannot modify the table anymore – you can only read the data from it.
The common solutions for this problem are:
– Using Partitioning
– Disabling Columnstore, modifying the data and Rebuilding the Columnstore Index then (thus activating it)
Sounds easy, doesn’t it ?
Well, like with everything in the real life, there are a couple of quite important gotchas here.
The “non-updatable” part is why I ignored non-clustered columnstore indexes. With SQL Server 2016, I’m going to take another look at them. But if you’re living on 2012 or 2014 for a while, this is a good post to give you an idea of how to load those tables.
Sunil Agarwal has a couple of posts on columnstore index defragmentation in SQL Server 2016.
Let us now look at how you can use REORGANIZE command to defragment your columnstore index. Note, this command is only supported for clustered columnstore index (CCI) and nonclustered columnstore index for disk-based tables. In the example below, I create an empty table and then create a clustered columnstore index and finally I load 300k rows. SQL Server 2016 loads data from staging table into CCI in parallel when you specify TABLOCK hint. The machine I ran this test on has 4 logical processors so the 300k rows got divided into 75k each between 4 threads. Since each thread was loading < 102400 rows, the columnstore index ends up with 4 delta rowgroups as shown below.
A compressed rowgroup is considered as fragmented when any of the following two conditions is met
Less than 1 million rows but the trim_reason ( please refer to https://msdn.microsoft.com/en-us/library/dn832030.aspx ) is other than DICTIONARY_SIZE. If the size of a compressed rowgroup is reduced because it has reached the maximum dictionary size, then it can’t be further reduced
It has nonzero deleted rows that exceeds a minimum threshold.
I just got finished with a first draft of a script to determine whether reorganizing a clustered columnstore index partition would be worthwhile, so this is great timing. I hope to make my script available soon, after I incorporate Sunil’s heuristics.
There is a very usable support for Columnstore Indexes within the temporary objects, but they are not appearing in any of the DMV’s to be analysed or optimised. This is especially sad in the relation to the global temporary tables which are some of the more useful temporary objects.
For the most part, I’d consider these reasonable results. Hopefully we can get columnstore stats on temp tables, but even that’s not a huge loss.
Do you need to be concerned about that a delta rowgroup is scanned single threaded? The answer is NO for two reasons (a) most columnstore indexes have very few delta rowgroups (b) if you have multiple delta rowgroups, they can be scanned in parallel with one thread per delta rowgroup
I have a beef with (a), at least for SQL Server 2014, but that’s a story for another day.
Recall that on rowstore tables (i.e. the tables organized as rows not as columnstore), SQL Server requires you to specify TABLOCK for parallel bulk import to get minimal logging and locking optimizations. One key difference for tables with clustered columnstore index is that you don’t need TABLOCK for getting locking/logging optimizations for bulk import. The reasons for this difference in behavior is that each bulk import thread can load data exclusively into a columnstore rowgroup. If the batch size < 102400, then the data is imported into a delta rowgroup otherwise a new compressed rowgroup is created and the data is loaded into it. Let us take two following interesting cases to show this bulk import behavior. Assume you are importing 4 data files, each with one bulk import thread, concurrently into a table with clustered columnstore index
The “don’t use TABLOCK” is interesting in comparison to rowstore tables.
To solve the performance problem I went straight to the DefaultBufferMaxRows setting and set it to be equal of the maximum number of rows in a Row Group – 1048576. Together with the AutoAdjustBufferSize setting it helps the actual current size of the DataFlow Buffer that will be used for transferring the data from the source to the destination table.
What should I say – it worked like magic:
I guess that with 2:09 Minutes the clear winner of this test is the configuration with AutoAdjustBufferSize set to True and the DefaultBufferMaxRows to 1048576. It took less then a half of the time with just AutoAdjustBufferSize activated and the insertion process was executed with the help of the Bulk Load API – meaning that we did not have to wait for the Tuple Mover or to execute it manually.
Doubling insertion performance is nothing to scoff, especially for something like columnstore tables, where we expect millions (or more) of rows to be inserted.