Category: Performance Tuning

Table Compaction in Apache Spark

Published 2025-02-27 by Kevin Feasel

If there anything that data engineers agree about, it’s that table compaction is important. Often one of the first big lessons that folks will learn early on is that not compacting tables can present serious performance issues: you’ve gotten your lakehouse pilot approved and it’s been running for a couple months in production and you find that both reads and writes are increasingly getting slower and slower while your data volumes have not increased drastically. Guess what, you almost surely have a “small file problem”.

What engineers won’t always sing the same tune on is how and when to perform table compaction.

Read on for a dive into the power of compaction (converting a large number of small files into a small number of large files) and plenty of tips along the way.

Comments closed

SQL Performance Office Hours

Published 2025-02-21 by Kevin Feasel

Erik Darling answers critical questions:

Does Erik prefer nested loops or fruit loops?

If Erik was an operator in an execution plan plus, what operator would he be?

The answers may shock you. Especially because Erik actually takes the critical question of nested loops vs Froot Loops and turns it into a proper answer.

Comments closed

A List of PostgreSQL Parameters

Published 2025-02-21 by Kevin Feasel

Semab Tariq has a list:

Have you ever experienced your database slowing down as the amount of data increases? If so, one important factor to consider is tuning PostgreSQL parameters to match your specific workload and requirements.

PostgreSQL has many parameters because it is designed to be highly flexible and customizable to meet a wide range of use cases and workloads. Each parameter allows you to fine-tune different aspects of the database, such as memory management, query optimization, connection handling, and more. This flexibility helps database administrators to optimize performance based on hardware resources, workload requirements, and specific business needs.

In this blog, I will cover some of the important PostgreSQL parameters, explain their role, and provide recommended values to help you fine-tune your database for better performance and scalability.

Click through for those parameters, including descriptions, default values, and recommendations.

Comments closed

An Overview of PostgreSQL Performance Monitoring via pgNow

Published 2025-02-19 by Kevin Feasel

Grant Fritchey announces a product:

I’ve been putting together a new PostgreSQL session called “Performance Monitoring for the Absolute Beginner.” There are several ways to get an understanding of how well your queries are running in PostgreSQL, but, frankly, all of them are a bit of a pain to someone coming from the land of Extended Events (ah, my one true love). Because of this, I saw it as an opportunity to help those just getting going in PostgreSQL. I’ll be presenting it for the first time at Postgres Conference in Orlando on March 19, 2025. Come on by.

Anyhoo, wouldn’t it be nice to maybe have a shortcut, an easier way to look at this information?

Well, there is. Redgate has been working on a completely free tool for leveraging just this sort of data called pgNow. Go here to check it out yourself, but I’ll do a quick run through here.

Click through to see how it works.

Comments closed

Tips for Scaling Apache Kafka

Published 2025-02-11 by Kevin Feasel

Narendra Lakshmana Gowda tunes a Kafka cluster:

Apache Kafka is known for its ability to process a huge quantity of events in real time. However, to handle millions of events, we need to follow certain best practices while implementing both Kafka producer services and consumer services.

Before start using Kafka in your projects, let’s understand when to use Kafka:

Much of the advice is pretty standard for performance tuning in Kafka, like setting batch size and linger time on the producer or managing consumers in a consumer group.

Comments closed

Improving Power Query CSV File Performance with Data Columns

Published 2025-02-11 by Kevin Feasel

Chris Webb makes things go faster:

A few weeks ago I replied to a question on reddit where someone was experiencing extremely slow performance when importing data from a CSV file using Power Query. The original poster worked out the cause of the problem and the solution themselves: they saw that removing all date columns from their query made their Power Query query much faster and that using the Date.FromText function and specifying the date format solved the problem. While I couldn’t reproduce the extreme slowness that was reported I was able to reproduce a performance difference between the two approaches and Curt Hagenlocher of the Power Query team confirmed that this was expected behaviour.

Read on for the example and explanation.

Comments closed

Preventing Skew in Teradata

Published 2025-02-05 by Kevin Feasel

Sudheer Kumar Lagisetty shares some performance tuning advice:

Teradata performance optimization and database tuning are crucial for modern enterprise data warehouses. Effective data distribution strategies and data placement mechanisms are key to maintaining fast query responses and system performance, especially when handling petabyte-scale data and real-time analytics.

Understanding data distribution mechanisms, workload management, and data warehouse management directly affects query optimization, system throughput, and database performance optimization. These database management techniques enable organizations to enhance their data processing capabilities and maintain competitive advantages in enterprise data analytics.

Click through for some tips around data distribution. This idea becomes important in an MPP architecture.

Comments closed

Speed Differences with Separating Data and Log Files

Published 2025-01-16 by Kevin Feasel

Brent Ozar performs a test:

I’ve already explained that no, it doesn’t make your database server more reliable – and in fact, it’s the exact opposite. But what about performance?

The answer is going to depend on your hardware and workload, but let’s work through an example. I’ll take the first lab workload from the Mastering Server Tuning class and set it up on an AWS i3en.2xlarge VM, which has 8 cores, 64GB RAM, and two 2.5TB NVMe SSDs. (This was one of the cheapest SQL-friendly VM types with two SSDs, but of course there are any number of ways you could run a test like this, including EBS volumes.)

I would expect cloud versus on-premises answers to be quite different, because cloud services tend to throttle you hard on how much storage throughput you’re allowed to have. For that reason, the results make perfect sense in AWS (or Azure or GCP for that matter), but unless your on-prem solution has hard throttles on IOPS or throughput because your sysadmins are monsters, the limits of performance would be in how hard you can push the drives or your storage controllers.

Ultimately, the most appropriate answer is to test your systems and not rely on expectations, especially if you’re shifting from on-premises to a cloud (or vice versa).

Comments closed

Temp Table Performance in PostgreSQL

Published 2025-01-15 by Kevin Feasel

Jobin Augustine shares a warning:

PostgreSQL is one of the most powerful database systems in the world. I have always been passionate about its great power, especially its modern SQL language features.

However, that doesn’t mean everything is great. There are areas where it hurts. Novice users unaware of the problem might get into trouble, and I see such things happen very often. This is a note for those who design their solutions around PostgreSQL using temporary tables.

This is very interesting to read, especially in comparison to SQL Server. It’s another example of the adage that, just because something works in a specific way on one platform, it’s not necessarily going to work the same way on another.

Comments closed

Optimizing Incremental Inventory Calculations in DAX

Published 2025-01-14 by Kevin Feasel

Marco Russo and Alberto Ferrari track account balances:

Computing an inventory level or an account balance at a given time is a common requirement for many reports. However, when the source data contains all the transactions since the initial zero state, the calculation requires a running sum from the beginning of the data history until the day considered. While easy to implement, a calculation like this can be extremely expensive depending on several factors: the number of cells to compute in the report, the data volume of the transactions, and the cardinality of the dimensions.

The usual approach to optimizing this type of calculation is to introduce a snapshot table that pre-calculates the value of each date for all the dimensions required. Because of the resulting data volume, this solution can be very expensive both in terms of processing time and in terms of resulting memory consumption. A tradeoff is to limit the cardinality of the time available for the snapshot, for example by creating a monthly or quarterly snapshot instead of a daily snapshot. However, this approach limits the analysis of inventory or balance amount trends, and it removes any detail below the snapshot cardinality.

This article shows how to implement a hybrid approach that minimizes the snapshot cost without losing analytical capabilities. This provides outstanding query performance for the reports.

To an extent, this replicates what accountants do with general ledger operations: after an accounting period ends, you summarize results. Then, you can start from the summarized end point instead of needing to read each journal entry. It also provides a convenient time to perform audits and ensure that the numbers are what they should be.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31