This table has a lot of columns, obviously. Now envision an interface where employees at World Wide Importers could perform searches for specific customers. They may want to perform different searches based on unique combinations of criteria, including (and this is not an exhaustive list by any means, but enough for my purposes today):
- CustomerID (e.g. pick the customer from a list);
- actual “bill to” CustomerID;
- buying group;
- when the account was opened;
- city; or,
- whether the account is on credit hold.
And any combination of the above. I know you’ve seen and probably built interfaces like this before, but just as a quick mockup, the employees would have a screen something like this:
There are plenty of ways to solve this problem, and Aaron shows a couple methods (including one which has major problems).
Your DataFrame is skewed if most of its rows are located on a small number of partitions, while the majority of the partitions remain empty. You really should avoid such a situation. Why? This makes your application virtually not parallel – most of the time you will be waiting for a single task to finish. Even worse, in some cases you can run out of memory on some executors or cause an excessive spill of data to a disk. All of this can happen if your data is not evenly distributed.
To deal with the skew, you can repartition your data using distribute by. For the expression to partition by, choose something that you know will evenly distribute the data. You can even use the primary key of the DataFrame!
It’s interesting to see how cluster by, distribute by, and sort by can have such different performance consequences.
I’ve made a few changes to the XQueryPlanPath project. The project parses query plans into xml and then using xpath to find the value of one or more nodes. This could then be used in testing to verify that any changes made to a query would retain a query plan that is considered optimal, and then if any changes break the test you can verify if the change causes sub-optimal effect on your query.
There was however one issue – query plans are like opinions; every SQL Server Instance has one, and none of them think that theirs stinks. So running a test on a dev box will potentially produce a different query plan from that on the build server, to that of production etc. This broadly because of 3 reasons:
Check it out, especially if your XML parsing skills aren’t top-notch.
This is one of a pair of query hints that can manage the memory grants of individual queries. The maximum memory grant hint allows you to limit the amount of memory an individual query can use. If the hint specifies a value lower than the amount of memory required to execute the query, the query will request the required amount. This is like a granular resource governor. For the rest of this post I will concentrate on the minimum memory grant query hint.
When the query optimiser builds an execution plan for a query, it calculates how much memory is required for the query to execute. This calculation is based on the estimated number of rows (used in sort or hash joins) and the average size of these rows. The optimiser calculates the least amount of memory required to perform the operation (required memory) and the amount of memory required to perform the whole operation in memory (desired memory). Each time the query is executed it requests a memory grant from the server. The query will perform much better if it’s granted its desired memory or more. If at run time the query has to perform sorts or hash operations on more than the estimated number of rows, it will spill to tempdb. Spilling to tempdb can be orders of magnitude worse for performance as your data has to be written to disk causing more IO.
This feels like the type of thing you might want to do once for a particularly recalcitrant query, but not something you want to do regularly.
I’ve seen multiple people state that a heap can be better than a clustered index for certain scenarios. I cannot disagree with that. One of the interesting reasons I’ve seen stated, though, is that a RID Lookup is faster than a Key Lookup. I’m a big fan of clustered indexes and not a huge fan of heaps, so I felt this needed some testing.
So, let’s test it!
I thought it would be good to create a database with two tables, identical except that one had a clustered primary key, and the other had a non-clustered primary key. I would time loading some rows into the table, updating a bunch of rows in a loop, and selecting from an index (forcing either a Key or RID Lookup).
It looks like RID lookups are slightly faster than key lookups. But check out the comments: this is a best-case scenario.
This explanation involves a similar problem but not involving SQL Server. Imagine you need to acquire twelve 4′ x 8′ plywood sheets from your local home improvement store.
You could drive to and from the store twelve times, and each time you need to go into the store, purchase the sheet, and wait for a staff member to become available to load the sheet into your pickup truck, then drive home and unload the sheet.
Or you could drive to the store once and purchase all twelve sheets in one go, with maybe four staff members making three trips each out to your pickup, carrying one sheet each time. Or even just one staff member making twelve trips out to your pickup.
Which method is more efficient? Multiple trips to the store or one trip to the store, no matter how many staff members are available to carry the sheets out?
On the SQL Server side, the APPLY operator is a method for bridging set-based operations with procedural thought patterns. It’s not the perfect answer for everything, but there are some fantastic use cases (like simplifying calculations via chained APPLY operators).
So now that 1). we have our table and b). we have an index we can use, we can run the developer’s query and be SARGable right?
([Test_Date]) = @
Nope! Table scan, ignores our Index and reads all 15M (too lazy for all the zeros) for a measely 127,782 rows! It’s not the slowest, taking around 3.960 seconds but still, we want SARGable!!!
Watch for the surprise twist at the end.
His approach was to use a cursor to cycle through all the columns in the provided table, analyze each column, determine the new data type, and store the information in a table variable. After the cursor was completed, the data in the table variable was written to a permanent table for the next process to use.
This approach isn’t necessarily bad. If you are only running it infrequently and you needed to write this stored procedure quickly, then it’s fine. But if this type of stored procedure needs to be run frequently, then it should be rewritten.
Set-based code tends to be easier to read and more compact than cursors, so even without the performance improvements they bring, there are benefits.
This code is coming from an ORM, which is parameterizing the filters, but not the (unnecessary & arbitrary) TOP value. The DMVs all viewed these are separate queries, so it was not aggregating the stats. It also wasn’t reusing the plan (thus chewing up even more CPU from frequent compiles). If the TOP had not been there, or it had been passed as a parameter, my initial query of sys.dm_exec_query_stats should have found it.
There are a couple of issues Andy works through, and his advice is good: just because something runs quickly doesn’t mean it can’t (in aggregate) have a negative effect on your server.
That’s a query against the XML stored in the ProductModel table. The view was created to mask the complexity of the necessary XPath code, while providing a mechanism for retrieving the data from the XML. This is a common use of views. However, when we then treat the view as a table, and join it to other tables and views, we present a problem for the optimizer. Because a view is not a table, but is instead a query, the optimizer has to resolve this query in combination with any other views or tables to arrive at an execution plan for the whole combined mess. While the optimizer is very good at what it does, because of the complexity caused by the additional unnecessary processing to figure out which parts of the view is not needed to satisfy the query, it can make poor choices. That can result in poor performance.
“Poor performance” can be an understatement.