Press "Enter" to skip to content

Category: Performance Tuning

Caching versus Persisting in Spark

The Hadoop in Real World team explains a subtle difference:

cache() and persist() functions are used to cache intermediate results of a RDD or DataFrame or Dataset. You can mark an RDD, DataFrame or Dataset to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, the objects behind the RDD, DataFrame or Dataset on which cache() or persist() is called will be kept in memory or on the configured storage level on the nodes. 

That’s the similarity, but click through for the difference.

Comments closed

Spark Performance in Azure Synapse Analytics

Euan Garden shares some numbers around Apache Spark performance in Azure Synapse Analytics:

To compare the performance, we derived queries from TPC-DS with 1TB scale and ran them on 8 nodes Azure E8V3 cluster (15 executors – 28g memory, 4 cores). Even though our version running inside Azure Synapse today is a derivative of Apache Spark™ 2.4.4, we compared it with the latest open-source release of Apache Spark™ 3.0.1 and saw Azure Synapse was 2x faster in total runtime for the Test-DS comparison.

Click through for several techniques the Azure Synapse Analytics team has implemented to make some significant performance improvements. It’s still slower than Databricks, but considerably faster than the open-source Apache Spark baseline.

Comments closed

Optimizing Power BI Data Load from a Folder of Parquet Files

Chris Webb has a tip for us:

In all the testing I’ve done recently with importing data from Parquet files into Power BI I noticed something strange: loading data from a folder containing multiple Parquet files seemed a lot slower than I would expect, based on the time taken to load data from a single file. So I wondered – is there something that can be optimised? It turns out there is and in this blog post I’ll show you what I did.

Click through to see how Chris cut load time down to approximately half what it was.

Comments closed

Cost versus Performance Optimization for SQL Server on VMs in Azure

Pam Lahoud takes a look at multi-constraint optimization:

So how do you get the best price-performance possible when configuring your SQL Server on Azure VM? In this blog, we’re going to cover three key aspects to right-sizing (and right-configuring) your Azure VM for SQL Server that are based on some common pitfalls customers face when migrating their on-premises workloads to Azure VM:

– Choosing the best VM series and size for your workload
– Configuring storage for maximum throughput and lower cost
– Leveraging unique to Azure features such as host caching to boost performance at no additional cost

One key point of the article is that there are several factors which can make a big difference in price and performance, but which you might not think about on-premises. It’s definitely worth taking the time to research this. It’s also a great example of how administrators are still important in a cloud-based world—having an admin who understands these settings and can get the most out of a given server can save a lot of money very quickly.

Comments closed

Parquet File Performance in Power Query

Chris Webb troubleshoots a performance issue:

There has been a lot of excitement around the newly-added support for reading from Parquet files in Power BI. However I have to admit that I was disappointed not to see any big improvements in performance when reading data from Parquet compared to reading data from CSV (for example, see here) when I first started testing it. So, is Power Query able to take advantage of Parquet’s columnar storage when reading data?

The answer is yes, but you may need to make some changes to your Power Query queries to ensure you get the best possible performance. Using the same data that I have been using in my recent series of posts on importing data from ADLSgen2, I took a single 10.1MB Parquet file and downloaded it to my PC.

It seem like an area of future growth for Power Query, but Chris does show how to eke out some gains right now.

Comments closed

Fun with Disabling Joins

David Alcock gives us a “wouldn’t it be neat to see this?” scenario:

If you’ve ever seen my session on Guillotines and the Query Optimiser I demonstrated how we can use query hints to switch off certain transformation rules in a query. Transformation rules use a pattern substitution which essentially swaps one expression for another. In SQL Server terms the best example of this is a join as although we use the logical expressions such as left join or inner join the physical join type used in the execution plan operator is substituted to be something like a Hash Match or Nested Loop Join.

This means we can force the behaviour of an execution plan by disabling certain rules. Now I can’t think of any practical reason for this and remember we can also use join hints in our queries but this is different as it affects the entire optimisation process for a query and not just one join so please follow in a sandbox environment.

In case you haven’t seen David’s session, it’s available on YouTube.

Comments closed

Hash Distributions, Shuffling, and Data Types

Reiss McSporran explains an issue in Azure Synapse Analytics dedicated SQL pools:

Oh dear. Shuffle. On the surface, shuffle is not necessarily a bad thing and may even be expected, but not in this scenario. So what is shuffle and why can it be bad?

Shuffle occurs when a part of a distributed table is moved to a different node during query execution. To do this a hash value is computed using the join columns, the node is then found that has that hash value and the row is then sent to that node for processing. If the tables you’re joining on have different hash distributions, you’d expect this to happen, (hence why it’s not always a bad thing!), but in this case I’ve used the same value to hash both of my tables, so what’s gone wrong?

Click through to see what’s gone wrong.

Comments closed

Memory Grant Feedback in SQL Server

Deepthi Goguri hits on one part of Intelligent Query Processing in SQL Server:

In this part, let us focus on the Memory Grant Feedback feature released in SQL Server 2017 for Batch mode execution and in SQL Server 2019 for Row mode execution.

Memory Grants are used by the SQL Server for the Hashes and Sort operations. SQL Server optimizer uses the statistics information and allocate the memory needed by the query before the query executes. When the query is executed, SQL Server uses the allocated memory to process the query for the hashes and sorts. If this memory grant is not enough to process the query, data will use tempdb spilling to disk. When too much memory is allocated based up on the estimates, we can effect the concurrency as all other queries requires memory grants to process the queries as well. Bad estimates can effect the memory grants allocated to the queries. Too much or too little memory grants is bad.

Read on to see how Memory Grant Feedback helps the optimizer out with queries over time.

Comments closed

Documenting dm_db_missing_index_group_stats_query

Erik Darling does a good deed:

When I was checking out early builds of SQL Server 2019, I noticed a new DMV called dm_db_missing_index_group_stats_query, that I thought was pretty cool.

It helped you tie missing index requests to the queries that requested them. Previously, that took a whole lot of heroic effort, or luck.

This has long been a pain point for index-based query tuning. You can guess what the types of queries can look like based on the suggested columns, but one risk of that guessing is that index order matters but the columns get returned in the order in which they exist on the table, not necessarily the order in which they would be most useful.

Comments closed

The Joy of Parameter Sniffing

Erik Darling points out that parameter sniffing is generally a good thing:

To some degree, I get it. You’re afraid of incurring some new performance problem.

You’ve had the same mediocre performance for years, and you don’t wanna make something worse.

The thing is, you could be making things a lot better most of the time.

We should specifically talk about parameter sniffing problems rather than parameter sniffing as a problem. These sorts of problems are closer to the exception than the rule.

Comments closed