Press "Enter" to skip to content

Category: Performance Tuning

Improving Dataset Refresh with Query Folding and the Dataflows Connector

Chris Webb has a performance tip for us:

You may have noticed that a new dataflows connector was announced in the August 2021 release of Power BI Desktop, and that it now supports query folding between a dataset and a dataflow – which you may be surprised to learn was not possible before. In this post I thought I’d take a look at how much of an improvement in performance this can make to dataset refresh performance.

Click through for the demonstration.

Comments closed

The Cost of Measures in Power BI Live Connection Reports

Chris Webb explains the cost side of the ledger when it comes to measure creation:

You probably know that it’s a best practice to build your Power BI datasets in a separate .pbix file from your reports – among other things it means that different people can develop the dataset and reports. You may also know that if you are building a report in Power BI Desktop with a Live connection to a published dataset or Azure Analysis Services you can define your own measures inside the report. While this is very convenient, if you create too many measures there’s a price to pay in terms of query performance.

Click through for a demonstration of this.

Comments closed

Performance Tips when Working with Large Datasets in R

Mira Celine Klein continues a series on performance tuning R code:

Whether your dataset is “large” not only depends on the number of rows, but also on the method you are going to use. It’s easy to compute the mean or sum of as many as 10,000 numbers, but a nonlinear regression with many variables can already take some time with a sample size of 1,000.

Sometimes it may help to parallelize (see part 3 of the series). But with large datasets, you can use parallelization only up to the point where working memory becomes the limiting factor. In addition, there may be tasks that cannot be parallelized at all. In these cases, the strategies from part 2 of this series may be helpful, and there are some more ways:

Click through for four options.

Comments closed

Caching Function Results in an R Package

Maelle Salmon and Cristophe Dervieux show us ways to cache results of function calls using R:

Caching means that if you call a function several times with the exact same input, the function is only actually run the first time. The result is stored in a cache of some sort (more practical details later!). Every other time the function is called with the same input, the result is retrieved from the cache unless invalidated. You will often think of caching as something valid in only one R session, but we’ll see it can be persistent across sessions via storage on disk.

As a quick note, this makes sense when writing functions, which are expressions without side effects. If you have side effects, caching might not give you what you expect.

Comments closed

Compilations per Second in SQL Server

Fabiano Amorim clarifies a metric’s definition:

As you can see, the number of SQL Compilations/Sec is very high. It’s important to step back and remember the general description and guideline for this counter and understand what I mean by “high”:

Official Description: “Number of SQL compilations per second. Indicates the number of times the compile code path is entered.”

Read on for a dive into ad hoc SQL statements parameterization, how an instance can have a high compilations/sec value relative to batch requests/sec, and how that can affect performance in the long run.

Comments closed

Tips for Decreasing the Impact of Rebalancing in Kafka Streams

Vasyl Sarzhynskyi has some techniques to make rebalancing in Kafka less of a big deal:

Kafka Rebalance happens when a new consumer is either added (joined) into the consumer group or removed (left). It becomes dramatic during application service deployment rollout, as multiple instances restarted at the same time, and rebalance latency significantly increasing. During rebalance, consumers stop processing messages for some period of time, and, as a result, processing of events from a topic happens with some delay. Some business cases could tolerate rebalancing, meanwhile, others require real-time event processing and it’s painful to have delays in more than a few seconds. Here we will try to figure out how to decrease rebalance for Kafka-Streams clients (even though some tips will be useful for other Kafka consumer clients as well).

Read on for an example of the problem, as well as several practical tips for mitigating the issue.

Comments closed

Minimizing PAGELATCH_EX Waits in SQL Server 2019

Aaron Bertrand has some advice for us:

There is a common problem in SQL Server where contention when writing to a single page can lead to excessive waiting. This waiting comes in the form of the wait type PAGELATCH_EX, and often happens when the clustering key is a monotonically increasing value – like an IDENTITY or date/time column. Pedro Lopes has highlighted some non-trivial workarounds in his post, “PAGELATCH_EX waits and heavy inserts,” and there is some official guidance in the Microsoft Docs article, “Resolve last-page insert PAGELATCH_EX contention in SQL Server.” But are there any ways to address this issue without intrusive changes?

Read on to see what Aaron has in mind.

Comments closed

Performance Issues with AT TIME ZONE

Jonathan Kehayias lays out a warning:

This is not the type of blog post that I enjoy writing. First let me start off by saying this is not a bash against Microsoft, the SQL Server team, or anything other than an informative post to let people know that there is a potential performance limitation in a very useful feature of SQL Server 2016 and higher. The AT TIME ZONE syntax was added in SQL Server 2016 to handle changing datetime values from one time zone to the offset of a different time zone using string names for the time zone. This is a great feature that simplifies converting datetime values but there is an unfortunate draw back to the implementation; it relies on the time zones that are stored in the Windows Registry and therefore has to make calls out to Windows which unfortunately occurs row-by-row for large result sets when used inside of a query.

Read on for more detail, as well as the way that Jonathan fixed the performance issue.

Comments closed

The Benefits of a Cache Layer

Monica Rathbun espouses upon the utility of Azure Cache:

One of the biggest impacts on resource consumption for Azure SQL DB are repeated data pulls by the application layer. No matter how fast those queries execute calling the same procedure or issuing the same SQL statements hundreds, thousands, or million times a day can wreak havoc on database performance. Death by a thousand cuts can easily bring a system to its knees. Sometimes it’s hard for DBAs to troubleshoot these actively as the execution of the statements happens so quickly they don’t even show in tools like sp_whoisactive. It’s not until you begin to dive into things like Query Performance Insights or Query Store that you start to see the real issue.

Check it out. And if you want to get into implementation, I’ve found the Cache-Aside design pattern to be useful. Bowen Li has a rundown of several caching patterns as well.

Comments closed

High-Performance ETL via Buffer Table

Daniel Hutmacher needs things to zoom:

It’s almost like a myth – one that I’ve heard people talk about, but never actually seen myself. The “shock absorber” is a pretty clever data flow design pattern to ingest data where a regular ETL process would choke on the throughput or spikes. The idea is to use a buffer table to capture incoming data, and then run an asynchronous process that loads that data in batches from the buffer into its intended target table.

While I’ve seen whitepapers and blog posts mention the concept loosely along with claims of “7x or 10x performance”, none of them go into technical detail on how it’s done, so I decided to try my hand at it.

I’ve compiled my findings, along with some pre-baked framework code if you want to try building something yourself. Professional driver on closed roads. It’s gonna get pretty technical.

Combine that with Eitan Blumin’s post yesterday and you’d think it were buffer week.

This shock absorber pattern works well for warehouse loading, especially when you’re trickle-loading data into columnstore indexes and don’t want to have open rowgroups slowing everything down.

Comments closed