Last month, I covered a puzzle involving matching each row from one table with the closest match from another table. I got this puzzle from Karen Ly, a Jr. Fixed Income Analyst at RBC. I covered two main relational solutions that combined the APPLY operator with TOP-based subqueries. Solution 1 always had quadratic scaling. Solution 2 did quite well when provided with good supporting indexes, but without those indexes also had quadric scaling. In this article I cover iterative solutions, which despite being generally frowned upon by SQL pros, do provide much better scaling in our case even without optimal indexing.
Itzik has three separate solutions here, including one using the CLR.
The gist of this kit is that it is a database repository as part of the sp_BlitzFirst to collect monitoring alerting and performance metric data. Once you’ve set this up, then you can use a Power BI desktop dashboard as an interface for all that data.
Now this is an awesome way to introduce more DBAs to Power BI and it’s a great way to get more out of your metrics data. The challenge is, it’s a lot of data to be performing complete refreshes on and the natural life of a database like this is growth. The refresh on the static database I was sent by Tracy, once I connected my PBIX to the local db sources, took upwards of an hour to refresh. Keep in mind, I have 16G of memory, 32G of swap and have upped my data load options in Power BI quite high.
Kellyn walks through the things she did to improve performance as a starting point, so check it out and be aware that there’s even more that can be done.
This sort of difference, scalar oriented
C++being so much faster than scalar oriented
R, is often distorted into “
This is just not the case. If we adapt the algorithm to be vectorized we get an
Ralgorithm with performance comparable to the
Not all algorithms can be vectorized, but this one can, and in an incredibly simple way. The original algorithm itself (
xlin_fits_R()) is a bit complicated, but the vectorized version (
xlin_fits_V()) is literally derived from the earlier one by crossing out the indices. That is: in this case we can move from working over very many scalars (slow in
R) to working over a small number of vectors (fast in
This is akin to writing set-based SQL instead of cursor-based SQL: you’re thinking in terms which make it easier for the interpreter (or optimizer, in the case of a database engine) to operate quickly over your inputs. It’s also one of a few reasons why I think learning R makes a lot of sense when you have a SQL background.
We expected the Semi Join to turn into an Anti Semi Join, but the plan now also contains a Nested Loop branch with a Row Count Spool – what’s that about? Turns out the Row Count Spool, along with its index seek, has to do with the NOT IN() and the fact that we’re looking at a nullable column. Remember that…
x NOT IN (y, z, NULL)
… always returns false, because the NULL value could represent essentially anything, including the x. And so it is with the inner table, if there happens to be a NULL value among those rows.
Definitely worth the read. There are ways around this performance hit but design decisions have consequences.
I was really curious about the RETURNS NULL ON NULL INPUT function option so I decided to do some testing. I was very surprised to find out that it’s actually a form of scalar UDF optimization that has been in the product since at least SQL Server 2008 R2.
It turns out that if you know that a scalar UDF will always return a NULL result when a NULL input is provided then the UDF should ALWAYS be created with the RETURNS NULL ON NULL INPUT option, because then SQL Server doesn’t even run the function definition at all for any rows where the input is NULL – short-circuiting it in effect and avoiding the wasted execution of the function body.
The more often you pass in NULL to that function, the better your performance will be relative to the default case.
The performance benchmarks on CDH 6.0 show that enabling Parquet vectorization significantly improves performance for a typical ETL workload. In the test workload (TPC-DS), enabling parquet vectorization gave 26.5% performance improvement on average (geomean value of runtime for all the queries). Vectorization achieves these performance improvements by reducing the number of virtual function calls and leveraging the SIMD instructions on modern processors. A query is vectorized in Hive when certain conditions like supported column data-types and expressions are satisfied. However, if the query cannot be vectorized its execution falls back to a non-vectorized execution. Overall, for workloads which use the Parquet file format on most modern processors, enabling Parquet vectorization can lead to better query performance in CDH 6.0 and beyond.
This is worth looking into, especially if you are on the Cloudera stack.
There are workloads where frequent page splits are a problem. I thought I had a system like this many years ago, so I tested various fill factor settings for the culprit table’s clustered index. While insert performance improved by lowering the fill factor, read performance drastically got worse. Read performance was deemed much more critical than write performance on this system. I abandoned that change and instead recommended a table design change since it made sense for that particular table.
Click through for a demo.
The question is now how to perform a coarse estimation of the equijoin cardinality of the highlighted steps, using the information available.
The original cardinality estimator would have performed a fine-grained step-by-step histogram alignment using linear interpolation, assessed the join contribution of each step (much as we did for the minimum step value before), and summed each step contribution to acquire a full join estimate. While this procedure makes a lot of intuitive sense, practical experience was that this fine-grained approach added computational overhead and could produce results of variable quality.
The original estimator had another way to estimate join cardinality when histogram information was either not available, or heuristically assessed to be inferior. This is known as a frequency-based estimation[…]
It’s an interesting dive into one of the changes in 2014’s CE. The post is a little math-heavy but Paul does a great job keeping it interesting.
Data Serialization – Serialization plays an important role in increasing the performance of any application. Spark provides two serialization libraries –
Java Serialization: By default, spark uses Java’s ObjectOutputStream framework which can work with any class that implements java.io.serializable. This serialization is flexible but slow and creates large serialized formats for many classes.
Kryo Serialization: Spark can use Kryo library to serialize objects. It is much faster and compact but does not support all serializable types. So we must register those classes which we want to be serialized. Therefore, Kryo uses indices instead of full class names to identify data types which reduce the size of the serialized data thereby increasing performance. We can initialize our spark conf by setting the value of the property spark.serializer to org.apache.spark.serializer.KryoSerializer. This serializer has a major impact on performance when we are shuffling or caching a large amount of data. To know more about this serializer, refer Kryo documentation
There are some good tips in here.
Index the predicates in JOIN, WHERE, ORDER BY and GROUP BY clauses
What about the HAVING clause? Does the column order matter? Should we put a single column or multi-column index? INCLUDE statements? What kind of index, clustered, non-clustered, columnstore, XML, spatial? This piece of the advice is benign but so non-specific it’s almost useless. Let me summarize: Indexes can be good.
Do not use sp_* naming convention
So, this one is true because it will add a VERY small amount of overhead as SQL Server searches the master database first for your object. However, for most of us, most of the time, this is so far down the list of worries about our database as to effectively vanish from sight.
There’s a pretty long list of things here, most of which Grant considers either incomplete, irrelevant, or sometimes flat-out wrong.