So 23 milliseconds for the batch version and 850 milliseconds for RBAR. What a difference.
Now in this case the code for the RBAR is also a lot more complicated. But that isn’t always the case. It also isn’t always the case that RBAR is slower. But it’s almost always a lot slower than batch.
So, while the code for RBAR is often easier to write, even though it might be physically longer, it’s probably going to be slower too.
Well-written, set-based solutions aren’t always guaranteed to be faster, but that’s one of the safest bets to make with T-SQL.
Start writing queries that demonstrate TSQL anti patterns – and make them slow
You know how people say that the best way to learn something is to teach it?
The best way to learn to speed up queries is to write slow ones.
The best way to get a job speeding up queries is to write a blog about the queries you’ve sped up.
This is a long-term learning process, but is absolutely a worthwhile skill for any database professional.
When I work with SQL Server batch-controlled workflows, I use the theory “feed the CPUs”. That’s the simplest positive adaptation I could come up with of Kevin Closson’s paradigm “Everything is a CPU problem” 🙂
What I mean by “Feed the CPUs” is that memory and disk response times are primary factors determining the maximum rate for the CPUs to process the data. Nuts & bolts of such a model for SQL Server are slightly different than a similar model for Oracle. SQL Server access to persistent data is always through database cache, while Oracle uses shared access to database cache in SGA and private access to persistent data through direct access in PGA.
Click through for more details.
LOB values are especially problematic when a trace captures the RPC completed event of a TVP query. Tracing uses memory from the OBJECTSTORE_LBSS memory pool to build trace records that contain TVP LOB values. From my observations of the sys.dm_os_memory_clerks DMV, each LOB cell of a TVP requires about 8K during tracing regardless of the actual value length. This memory adds up very quickly when many rows and lob columns are passed via a TVP with a trace running. For example, the 10,000 row TVP with 10 LOB columns used in the earlier test required over 800MB memory for a single trace record. Consider that a large number of TVP LOB cells and/or concurrent TVP queries can cause queries to fail with insufficient memory errors. In extreme cases, the entire instance can become unstable and even crash under due to tracing of TVP queries.
This is a must-read if you use TVPs in your environment.
For row store targets, it is important to note that the presence of a clustered index or any additional non-clustered indexes on the target table will disable the parallel INSERT behavior. For example, here is the query plan on the same table with an additional non-clustered index present. The same query takes 287 seconds without a TABLOCK hint and the execution plan is as follows
This post goes into detail on when you can expect parallelism in rowstore and columnstore insertions. I highly recommend reading it.
As can be seen in the logs, the localityWaitFactor value is 1, but the delay that this code causes grows linearly with the number of required containers. Since our DFSIO-large benchmark creates 1,024 files, each 1 GB in size, it requests 1,024 YARN containers. Therefore, the code has to miss at least 1,024 scheduling opportunities until it schedules containers on this (wrongly assumed) OFF_SWITCH node.
But why is this delay enforced? This idea falls into a big area of scheduling research. The Delay Scheduling algorithm was introduced by Matei Zaharia’s EuroSys ’10 paper titled “Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling”.
That post is a bit deeper than my Hadoop administration comfort level, but if you’re given the task of performance tuning a cluster, this might be one place to look.
It’s not enough to know that you have a slow query or queries. You need to know exactly how slow they are. You must measure. You need to know how long they take to run and you need to know how many resources are used while they run. You need to know these numbers in order to be able to determine if, after you do something to try to help the query, you’ll know whether or not you’ve improved performance. To measure the performance of queries, you have a number of choices. Each choice has positives and negatives associated with them. I’m going to run through my preferred mechanisms for measuring query performance and outline why. I’ll also list some of the other mechanisms you have available and tell you why I don’t like them. Let’s get started.
This is an intro-level blog post, so Grant doesn’t go into much detail, but he does provide some good links for getting started.
The cluster’s servers and SQL Server configurations were built to be as close to identical as possible to the previous instance (memory, cores, disk, maxdop, CTP, etc).
After the migration, I noticed that CPU utilization jumped from the normal 25% to a consistent 75%.
I did several other migrations with similar server loads with no issues, so I’m a bit puzzled as to what might be going on here. Could the upgrade from SQL Server 2008 R2 to SQL Server 2012 simply be exposing bad queries that 2008 was handling differently?
Kendra goes through a number of reasons, building a troubleshooting guide in the process. This is a great read.
The query runs faster, make no mistake – but check out the estimates:
- Estimated number of rows = 1
- Actual number of rows = 165,367
Those estimates are built by SQL Server’s cardinality estimator (CE), and there have been major changes to it over the last couple of versions. You can control which CE you’re using by changing the database’s compatibility level. This particular StackOverflow database is running in 2016 compat mode – so what happens if we switch it back to 2012 compat mode?
Based on this result, there might be further optimizations available. Read on for more of Brent’s thoughts.
Well obviously something somewhere is different. Start by comparing everything on both servers and both databases down to… hang on, here, we’ll write a PowerShell script and then….
Wait, wait, wait!
You have the execution plans? Before we start digging through all the properties everywhere and comparing everything to everything, updating statistics 14 times, and all the rest, what if we look at the execution plans. They’re different, so we should start looking at scans & indexes & statistics &….
Sometimes, it’s the little things that matter.