Performance Tuning – Page 23

Whether your dataset is “large” not only depends on the number of rows, but also on the method you are going to use. It’s easy to compute the mean or sum of as many as 10,000 numbers, but a nonlinear regression with many variables can already take some time with a sample size of 1,000.
Sometimes it may help to parallelize (see part 3 of the series). But with large datasets, you can use parallelization only up to the point where working memory becomes the limiting factor. In addition, there may be tasks that cannot be parallelized at all. In these cases, the strategies from part 2 of this series may be helpful, and there are some more ways:

Click through for four options.

Comments closed

Caching Function Results in an R Package

Published 2021-08-03 by Kevin Feasel

Maelle Salmon and Cristophe Dervieux show us ways to cache results of function calls using R:

Caching means that if you call a function several times with the exact same input, the function is only actually run the first time. The result is stored in a cache of some sort (more practical details later!). Every other time the function is called with the same input, the result is retrieved from the cache unless invalidated. You will often think of caching as something valid in only one R session, but we’ll see it can be persistent across sessions via storage on disk.

As a quick note, this makes sense when writing functions, which are expressions without side effects. If you have side effects, caching might not give you what you expect.

Comments closed

Compilations per Second in SQL Server

Published 2021-07-30 by Kevin Feasel

Fabiano Amorim clarifies a metric’s definition:

As you can see, the number of SQL Compilations/Sec is very high. It’s important to step back and remember the general description and guideline for this counter and understand what I mean by “high”:
Official Description: “Number of SQL compilations per second. Indicates the number of times the compile code path is entered.”

Read on for a dive into ad hoc SQL statements parameterization, how an instance can have a high compilations/sec value relative to batch requests/sec, and how that can affect performance in the long run.

Comments closed

Tips for Decreasing the Impact of Rebalancing in Kafka Streams

Published 2021-07-28 by Kevin Feasel

Vasyl Sarzhynskyi has some techniques to make rebalancing in Kafka less of a big deal:

Kafka Rebalance happens when a new consumer is either added (joined) into the consumer group or removed (left). It becomes dramatic during application service deployment rollout, as multiple instances restarted at the same time, and rebalance latency significantly increasing. During rebalance, consumers stop processing messages for some period of time, and, as a result, processing of events from a topic happens with some delay. Some business cases could tolerate rebalancing, meanwhile, others require real-time event processing and it’s painful to have delays in more than a few seconds. Here we will try to figure out how to decrease rebalance for Kafka-Streams clients (even though some tips will be useful for other Kafka consumer clients as well).

Read on for an example of the problem, as well as several practical tips for mitigating the issue.

Comments closed

Minimizing PAGELATCH_EX Waits in SQL Server 2019

Published 2021-07-27 by Kevin Feasel

Aaron Bertrand has some advice for us:

There is a common problem in SQL Server where contention when writing to a single page can lead to excessive waiting. This waiting comes in the form of the wait type PAGELATCH_EX, and often happens when the clustering key is a monotonically increasing value – like an IDENTITY or date/time column. Pedro Lopes has highlighted some non-trivial workarounds in his post, “PAGELATCH_EX waits and heavy inserts,” and there is some official guidance in the Microsoft Docs article, “Resolve last-page insert PAGELATCH_EX contention in SQL Server.” But are there any ways to address this issue without intrusive changes?

Read on to see what Aaron has in mind.

Comments closed

Performance Issues with AT TIME ZONE

Published 2021-07-23 by Kevin Feasel

Jonathan Kehayias lays out a warning:

This is not the type of blog post that I enjoy writing. First let me start off by saying this is not a bash against Microsoft, the SQL Server team, or anything other than an informative post to let people know that there is a potential performance limitation in a very useful feature of SQL Server 2016 and higher. The AT TIME ZONE syntax was added in SQL Server 2016 to handle changing datetime values from one time zone to the offset of a different time zone using string names for the time zone. This is a great feature that simplifies converting datetime values but there is an unfortunate draw back to the implementation; it relies on the time zones that are stored in the Windows Registry and therefore has to make calls out to Windows which unfortunately occurs row-by-row for large result sets when used inside of a query.

Read on for more detail, as well as the way that Jonathan fixed the performance issue.

Comments closed

The Benefits of a Cache Layer

Published 2021-07-15 by Kevin Feasel

Monica Rathbun espouses upon the utility of Azure Cache:

One of the biggest impacts on resource consumption for Azure SQL DB are repeated data pulls by the application layer. No matter how fast those queries execute calling the same procedure or issuing the same SQL statements hundreds, thousands, or million times a day can wreak havoc on database performance. Death by a thousand cuts can easily bring a system to its knees. Sometimes it’s hard for DBAs to troubleshoot these actively as the execution of the statements happens so quickly they don’t even show in tools like sp_whoisactive. It’s not until you begin to dive into things like Query Performance Insights or Query Store that you start to see the real issue.

Check it out. And if you want to get into implementation, I’ve found the Cache-Aside design pattern to be useful. Bowen Li has a rundown of several caching patterns as well.

Comments closed

High-Performance ETL via Buffer Table

Published 2021-07-15 by Kevin Feasel

Daniel Hutmacher needs things to zoom:

It’s almost like a myth – one that I’ve heard people talk about, but never actually seen myself. The “shock absorber” is a pretty clever data flow design pattern to ingest data where a regular ETL process would choke on the throughput or spikes. The idea is to use a buffer table to capture incoming data, and then run an asynchronous process that loads that data in batches from the buffer into its intended target table.
While I’ve seen whitepapers and blog posts mention the concept loosely along with claims of “7x or 10x performance”, none of them go into technical detail on how it’s done, so I decided to try my hand at it.
I’ve compiled my findings, along with some pre-baked framework code if you want to try building something yourself. Professional driver on closed roads. It’s gonna get pretty technical.

Combine that with Eitan Blumin’s post yesterday and you’d think it were buffer week.

This shock absorber pattern works well for warehouse loading, especially when you’re trickle-loading data into columnstore indexes and don’t want to have open rowgroups slowing everything down.

Comments closed

Buffering Events in SQL Server

Published 2021-07-14 by Kevin Feasel

Eitan Blumin has a technique to reduce expensive upserts:

Do you find yourself facing performance problems and long lock chains caused by very frequent INSERT, UPDATE, or DELETE statements being executed on a table? Check out this neat trick that could help you out and make all the difference in the world.
Okay, I admit that title ended up being quite long. But I wanted something that could be easily found in search engines by people facing similar problems.

I’ve done something similar, though without the partition switch and instead deleting batches into a temp table. This is a good example of something I like to say about scalable processes in T-SQL: many times, the most scalable technique involves a mental pivot (and sometimes a literal pivot, such as using tally tables to work with string data) of the straightforward answer.

1 Comment

Identifying Backpressure in Apache Flink

Published 2021-07-08 by Kevin Feasel

Piotr Nowojski explains an important concept in streaming (and ELT/ETL) products:

The backpressure topic was tackled from different angles over the last couple of years. However, when it comes to identifying and analyzing sources of backpressure, things have changed quite a bit in the recent Flink releases (especially with new additions to metrics and the web UI in Flink 1.13). This post will try to clarify some of these changes and go into more detail about how to track down the source of backpressure, but first…

Read on for the full story, including a review of the concept and its importance.

Comments closed

Category: Performance Tuning

Performance Tips when Working with Large Datasets in R

Caching Function Results in an R Package

Compilations per Second in SQL Server

Tips for Decreasing the Impact of Rebalancing in Kafka Streams

Minimizing PAGELATCH_EX Waits in SQL Server 2019

Performance Issues with AT TIME ZONE

The Benefits of a Cache Layer

High-Performance ETL via Buffer Table

Buffering Events in SQL Server

Identifying Backpressure in Apache Flink