The position of the bitmap has changed so that it’s evaluated after the key lookup. That makes sense because the key lookup returns the column to be filtered against. However, the bitmap filter still reduces the estimated number of key lookups from 3000000 to 3000. This is impossible. The filter can only be applied after the key lookup, so it does not make sense for the bitmap to reduce the number of estimated executions of the key lookup.
Performance is significantly worse with the query now requiring 12199107 logical reads from the rowstore table and 13406 CPU time overall. We can see that the query did three million key lookups:
This is a fairly deep post, so you’ll probably want to check out the Paul White post on bitmaps first.
The common, by the book approach recommends dropping columnstore index, splitting or merging partitions and recreating the index afterwards. As you can imagine, it would lead to extremely inefficient process with huge amount of unnecessary overhead on large tables. After all, you have to drop and recreate columnstore index, converting table to Heap, while just subset of the partitions needs to be rebuilt. Fortunately, you can minimize the overhead with simple workaround:
Switch partition(s) to split or merge to the separate staging table
Split or merge partition(s) in the main table. You can do that because partitions will be empty after the previous step
Drop columnstore index in the staging table, split/merge partition(s) there and recreate the index afterwards
Switch partition(s) back from staging to the main table.
Read on for a detailed walkthrough of these steps.
The biggest difference resides in partition function split and merge behavior. With B-Tree indexes, you can split and merge non-empty partitions. SQL Server would split or merge the data automatically, granted with the schema-modification (Sch-M) table lock held in place. Other sessions would be unable to access the table but at least split and merge would work.
This is not the case with columnstore indexes where you would get the error when you try to split or merge non-empty partitions. There are several reasons for this limitation. Without diving very deep into columnstore index internals, I could say that the only option of doing split of merge internally is rebuilding columnstore index on affected partitions. Fortunately, you can split and merge empty columnstore partitions, which allow you to workaround the limitation and also implement Sliding Window pattern and use partitioning to purge the data.
With SQL Server 2017, the logic gets a little simpler, as you can directly truncate partitions instead of shuffling them off to a separate table.
The insert query now takes 3594 ms of CPU time and 2112 ms of elapsed time on my machine. The size of each rowgroup did not change. It’s still 2098320 bytes. Even though this is a parallel query there’s no element of randomness in this case. In the query plan we can see that the source table was scanned in a serial zone and round robin distribution is to used to distribute exactly half of the rows to each parallel thread.
This seems like a reasonable plan given that
TOPforces a serial zone and we need to preserve order. It’s possible to rewrite the query to encourage a parallel scan of the source table, but that would introduce an order-preserving gather streams operator.
Click through for the full story.
We’ve only observed this deadlock with multiple concurrent sessions insert to the delta store for the same target CCI due to server memory pressure or very low cardinality estimates (less than 251 rows). The correct mitigation depends on why you’re seeing the issue in the first place. If you’re seeing it due to low cardinality estimates then fix your estimates, or at the very least get them above 250 rows. If you’re seeing them because the memory grant for the CCI build times out after 25 seconds then lower concurrency or increase server memory.
The problem can also be avoided by not doing concurrent inserts in the first place. In some cases a parallel insert may be a reasonable alterative. There’s also some evidence that the deadlock is only seen when the number of rows for insert is very close to 1048576, but we weren’t able to make any definitive conclusions around that.
Read the whole thing. Also check out his Connect item.
The following sample demonstrates the high scale and performance of SQL Database, with the ability to insert 1.4 million rows per second by using a non-durable memory-optimized table to speed up data ingestion, while managing the In-Memory OLTP storage footprint by offloading historical data to a disk-based Columnstore table for real time analytics. One of the customers already leveraging Azure SQL Database for their entire IoT solution is Quorum International Inc., who was able to double their key database’s workload while lowering their DTU consumption by 70%.
If you hit on the right scenario, memory-optimized tables can be great.
After waiting for the 25 seconds (notice the difference between the request_time and grant_time is exactly 25 seconds), the engine decides to grant some minimum amount of memory anyway, allowing the process to carry on, without being cancelled, but the penalisation is very heavy – the inserts will not go into the compressed row groups, but into the Delta-Stores, making this operation not-minimally-logged and in other words, painfully slow and inefficient.
To confirm the final results, let’s check on the Row Groups of our tables, given that we have canceled the inserts into the 2 first tables, we expect 1 row group for the [dbo].[FactOnlineSales_Stage3] table and 1 row group for the [dbo].[FactOnlineSales_Stage4] table, corresponding to the 3rd and 4th threads of data loading:
As Niko points out, this could be the difference between a well-behaved, single compressed rowgroup load versus dumping a million rows into the deltastore.
I briefly reviewed the documentation written by Microsoft concerning the appearance of delta stores. Here’s a quote:
Rows go to the deltastore when they are:
Inserted with the INSERT INTO VALUES statement.
At the end of a bulk load and they number less than 102,400.
Updated. Each update is implemented as a delete and an insert.
There are also a few mentions of how partitioning can lead to the creation of multiple delta stores from a single insert. It seems as if the document is incomplete or a little misleading, but I admit that I didn’t exhaustively review everything. After all, Microsoft hides columnstore documentation all over the place.
This is a great compendium of ways in which you can shoot yourself in the foot with clustered columnstore indexes.
I expect not just a couple of rows to be sent over for the Machine Learning Services, but huge tables with million of rows, that also contain hundreds of columns, because this kind of tables are the basis for the Data Science and Machine Learning processes.
While of course we are focusing here on rather small part of the total process (just the IO between SQL Server relational Engine and the Machine Learning Services), where the analytical process itself can take hours, but the IO can still make a good difference in some cases.
I love this improvement, which is very under-the-hood, but it will help a couple of people to make a decision of migrating to the freshly released SQL Server 2017 instead of the SQL Server 2016.
I haven’t quite taken advantage of this yet (just moved to 2017 but still in 130 compatibility mode) but fingers crossed that I’ll see those improvements.
The estimated costs for the two queries are very close to each other: 74.6842 and 74.6839 optimizer units. However, we saw earlier that the tipping point for an adaptive join on this query can vary between 22680 and 80388.3 rows. This inconsistency means that we can find a query that performs worse with adaptive joins enabled.
Click through to see the queries Joe is using. Based on this, I’d guess that this is probably a knife-edge problem: most of the time, adaptive join processing is better, but if you hit the wrong query, it’s worse.