Performance Tuning – Page 20

The Practical Costs of Index Fragmentation

Published 2022-03-29 by Kevin Feasel

Tibor Karaszi digs into index performance:

See numbers and diagrams at the end, or at the top. I measured a few cases: the difference between no external fragmentation and severe external fragmentation (over 99%). I have both a narrow index and a wide index, and I read one (1), 10,000 and 100,000 rows using index searches (“range scan”). There were obviously no difference reading 1 row so I exclude that from my discussion below. For the other cases the extra time with an extreme level of external fragmentation is (from lowest impact to highest) 7%, 10%, 13% and 32%. The highest number (32%) is when reading many rows from a narrow index, i.e. many rows per page. Again, this is with an extreme level of fragmentation.

What’s interesting is that for the most part, there’s a negligible difference between ~0% internal fragmentation and ~99% internal fragmentation. The follow-on question is, how much are defrag operations costing you in performance and when is the benefit worth the cost?

Comments closed

Query Performance Insights on the Serverless SQL Pool

Published 2022-03-10 by Kevin Feasel

Jovan Popovic shows how you can use the QPI library on an Azure Synapse Analytics serverless SQL pool:

You can find more of the best practices here. These best practices are very important because some issues might cause performance degradation. You might be surprised how applying some of these best practices might improve the performance of your workload.
The last item that is related to schema optimization is sometimes hard to check. You would need to look at your schema, inspect all columns and find what to optimize. If you have a large schema, this might not be an easy task. But you can make your life easier if you use the QPI helper library that can detect schema issues for you.

Read on to see what it can find.

Comments closed

Bucketing Data in Hive

Published 2022-01-20 by Kevin Feasel

Chitra Sapkal explains why bucketing in Hive can be so powerful:

When a column has a high cardinality, we can’t perform partitioning on it. A very high number of partitions will generate too many Hadoop files which would increase the load on the node. That’s because the node will have to keep the metadata of every partition, and that would affect the performance of that node
In simple words, You can use bucketing if you need to run queries on columns that have huge data, which makes it difficult to create partitions.

Click through to see how bucketing works and examples of how you can use it to make queries faster.

Comments closed

Improving Managed Instance Load Performance

Published 2022-01-19 by Kevin Feasel

Niko Neugebauer provides some tips on improving data load performance when using SQL Managed Instances in the General Purpose tier:

In this blog post we shall consider some of the strategies for improving data loading performance in Azure SQL Managed Instance. These strategies apply to the repeatable ETL processes, meaning that if there a problem, data loading process can be repeated without loss of any bit of data or its consistency. Of course, these strategies do require a prepared design for the possibility of the repetition, but this is a basic requirement for any data loading process.
There are many great ways of ensuring high performance data loading and this blog post does not pretend to be exhaustive an ultimate resource, but it will provide a couple of known paths that can drastically improve the performance of the Log Throughput in Azure SQL MI.

Click through for some tips, including some which make sense on-prem and a couple which are specific to platform-as-a-service offerings.

Comments closed

The Performance Impact of Dissimilarly-Sized tempdb Files

Published 2022-01-10 by Kevin Feasel

Chris Taylor puts on the lab coat and safety goggles:

tldr: Over the years I’ve read a lot of blog posts and watched a lot of videos where they mention that you should have your tempdb files all the same size. What I haven’t seen much of (if any) is what performance impact you actually see if they are not configured optimally. This blog post aims to address that

It is not too long, so do read.

Comments closed

Lessons Learned Troubleshooting High CPU in Azure SQL DB

Published 2022-01-07 by Kevin Feasel

Kendra Little has an after-action report:

I’ve just had the pleasure of publishing my first new article in the Microsoft Docs, Diagnose and troubleshoot high CPU on Azure SQL Database.
This article isn’t really “mine” – anyone in the community can create a Pull Request to suggest changes, or others at Microsoft may take it in a different direction. But I got to handle the outlining, drafting, and incorporation of suggested changes for the initial publication.
It was a ton of fun, and I learned a lot about Azure SQL Database in the process.

Click through for what Kendra learned specific to Azure SQL Database, and also read the article itself.

Comments closed

Dedicated SQL Pool Index, Distribution, and Partition Guidance

Published 2021-12-22 by Kevin Feasel

I have a write-up on the specific value of distributions, indexes, and partitions in Azure Synapse Analytics dedicated SQL pools:

Not too long ago, I ended up taking the DP-203 certification exam for sundry reasons. On that exam, they ask a lot about Azure Synapse Analytics, including indexing, distribution, and partitioning strategies. Because these can be a bit different from on-premises SQL Server, I wanted to cover what options are available and when you might choose them. Let’s start with distributions, as that’s the biggest change in thought process.

Read on for the guidance.

Comments closed

Spark SQL Bucketing and Query Tuning

Published 2021-12-15 by Kevin Feasel

Tomaz Kastrun continues a series on Apache Spark. Part 13 looks at bucketing and partitioning in Spark SQL:

Partitioning and Bucketing in Hive are used to improve performance by eliminating table scans when dealing with a large set of data on a Hadoop file system (HDFS). The major difference between them is how they split the data.

Part 14 covers query hints:

This hint instructs Spark to use the hinted strategy on specified relation when joining tables together. When BROADCASTJOIN hint is used on Data1 table with Data2 table and overrides the suggested setting of statistics from configuration spark.sql.autoBroadcastJoinThreshold.
Spark also prioritise the join strategy, and also when different JOIN strategies are used, Spark SQL will always prioritise them.

Be sure to check those out.

Comments closed

Batch Mode and Window Functions

Published 2021-12-07 by Kevin Feasel

I wind down a series on window functions:

SQL Server typically operates in row mode, which means that an operator processes one row at a time. This sounds inefficient, but tends to work out pretty well in practice. However, something which may work out even better is to process more than one row at a time, especially when the number of rows gets to be fairly large. Enter batch mode.
Batch mode was introduced in SQL Server 2012 alongside non-clustered columnstore indexes. It became interesting in SQL Server 2016 and very interesting in SQL Server 2019. That’s because 2016 introduced writable clustered columnstore indexes and 2019 gives us batch mode outside of columnstore indexes.

There are some nice potential performance gains for queries involving window functions.

Comments closed

Abnormal Tables and Skewed Data

Published 2021-10-29 by Kevin Feasel

Erik Darling reminds us to be vigilant in database design:

But the Posts table suffers from a serious design flaw in the public data dump: Questions and Answers are in the same table.
I’ve heard that it’s worse behind the scenes, but I don’t have any additional details on that.

Read on to understand why this is a problem and what the ramifications are.

Comments closed

Category: Performance Tuning