Press "Enter" to skip to content

Category: Performance Tuning

Spark SQL Bucketing and Query Tuning

Tomaz Kastrun continues a series on Apache Spark. Part 13 looks at bucketing and partitioning in Spark SQL:

Partitioning and Bucketing in Hive are used to improve performance by eliminating table scans when dealing with a large set of data on a Hadoop file system (HDFS). The major difference between them is how they split the data.

Part 14 covers query hints:

This hint instructs Spark to use the hinted strategy on specified relation when joining tables together. When BROADCASTJOIN hint is used on Data1 table with Data2 table and overrides the suggested setting of statistics from configuration spark.sql.autoBroadcastJoinThreshold.

Spark also prioritise the join strategy, and also when different JOIN strategies are used, Spark SQL will always prioritise them.

Be sure to check those out.

Comments closed

Batch Mode and Window Functions

I wind down a series on window functions:

SQL Server typically operates in row mode, which means that an operator processes one row at a time. This sounds inefficient, but tends to work out pretty well in practice. However, something which may work out even better is to process more than one row at a time, especially when the number of rows gets to be fairly large. Enter batch mode.

Batch mode was introduced in SQL Server 2012 alongside non-clustered columnstore indexes. It became interesting in SQL Server 2016 and very interesting in SQL Server 2019. That’s because 2016 introduced writable clustered columnstore indexes and 2019 gives us batch mode outside of columnstore indexes.

There are some nice potential performance gains for queries involving window functions.

Comments closed

Storage Testing for Azure SQL Managed Instances

Joe Obbish busts out the slide rule:

Lately I’ve been doing some exploratory performance testing on Azure SQL Managed Instances in preparation for a migration to that platform. This blog post documents some storage testing results and may even have practical advice near the end. All testing was done on a gen5 general purpose instance with 8 vCores.

Read on for Joe’s findings. Spoiler alert: there is practical advice at the end.

Comments closed

Measuring the Performance Impact of TDE

Manvendra Singh does the math:

Transparent Data Encryption (TDE) encrypts database files to secure your data. It also encrypts the tempdb database to secure your data in a temporary space. The process of encryption and decryption adds additional overhead to the database system. Even non-encrypted databases hosted on the same SQL Server instance would have some performance degradation because of tempdb encryption. Today I will show you performance impact analysis using few simple T-SQL statements by comparing their stats gathered before and after enabling TDE.

I will execute T-SQL statements (INSERT, UPDATE, SELECT, BACKUP DATABASE) before and after enabling encryption (Transparent Data Encryption) and gather their performance statistics during each execution. Finally, once we will have performance stats taken before and after enabling encryption then we will compare them to understand the performance impact analysis. I have used DBCC DROPCLEANBUFFERS before executing each query to clean the buffer cache.

Click through for the results. My rule of thumb is about a 5% performance overhead, but Manvendra shows us some of the particulars of what that means.

Comments closed

Star Schemas versus Header-Detail Tables in Power BI

Marco Russo and Alberto Ferrari lay out another proof that the star schema is the right schema for Power BI:

We already shown in a previous article (Power BI – Star schema or single table – SQLBI) how the star schema proves to be the best option when compared with a single table model. Single-table models are the evil: do not be tempted by them, choose a star schema.

In this article, I want to show you an example in the opposite direction. A single table model denormalizes everything in one table, and we already learned that it is bad. But what if we keep a more normalized structure, as it often happens in header/detail models (like orders and order lines)? Is a header/detail model better than a star schema? The quick answer is: “No. Nope. No way. Not at all. Are you kidding me? No.”. Nonetheless, this might be just our personal opinion. The goal of the article is to provide you with some numbers and considerations to prove the previous statement.

Read on and you make the call.

Comments closed

Measuring File Latency in SQL Server

Anthony Nocentino has a script and some tips for us:

This post is a reference post for retrieving IO statistics for data and log files in SQL Server. We’ll look at where we can find IO statistics in SQL Server, query it to produce meaningful metrics, and discuss some key points when interpreting this data.

Click through for the script, and then a bulleted list of things to keep in mind as you’re reviewing the data.

Comments closed

Against Reporting Tables

Erik Darling doesn’t like reporting tables:

I’ve seen a lot of crazy ways for people to run reports on live data, but one of the worst ideas is creating a reporting table based on the set of data a user wants, and then letting them query that data.

As usual, Erik says something I want to disagree with, and then I read the post and don’t really disagree with him—or if I do, he’s already laid out the “Yes, I understand X” exception. I’ve used reporting tables to good effect, but the important thing is that they’re general-purpose and designed into the application, not specific to a single user.

Comments closed

Power BI Data Models with Multiple Aggregation Tables

Phil Seamark continues a series on Power BI aggregations:

Why might you consider having more than one aggregation table? The short answer is speed and overall resource efficiency.

Consider a heavily used Power BI report where page-load time is considered critical. A typical report may have half a dozen visuals on a page showing values computed over various grains. If the model used by the report has no aggregation tables, all calculations use the raw fact tables to produce values for each metric.

Adding an aggregation table to the model allows the same calculations as before to use smaller tables to produce the same result. Calculations using smaller aggregation tables will enable the server hosting the data model to use much less effort per query.

It was interesting to see just how easy the process is.

Comments closed