Press "Enter" to skip to content

Category: Performance Tuning

Shrinking Your Power BI Dataset Sizes

Gilbert Quevauvilliers wanted to reduce Power BI memory usage:

I had already applied all the best practices in terms of reducing the cardinality, removing unwanted columns and making sure that only the data required is being brought into the dataset. Even at this point the dataset size was consuming 90GB of memory in Azure Analysis Services. With the steps below I got my dataset size down to a whopping 37GB of memory!

I used the awesome tools from SQLBI.COM and DAX Studio to see which columns were consuming the most space, and because my dataset had currency converted values, this meant that the cardinality was very high. (The reason that I decided to store the currency conversion values, is when trying to do it on the fly in a large dataset it is very slow)

Two simple tricks led to a pretty nice reduction in size.

Comments closed

Aggregations in Power BI

Shabnam Watson takes us through aggregations in Power BI:

In Power BI, Aggregations start as tables just like any other table in a model. They can be based off a view or table in the source database, or created in Power BI with Power Query. They can be in Import or Direct Query storage mode.

Once in the model, these tables can be configured so that the engine can use them instead of a detail table to answer queries when possible. The process of creating and configuring aggregations in Power BI is significantly easier than the process of creating aggregations in SSAS multidimensional.

Once an aggregation table is configured, it becomes hidden from end users. Report developers and end users don’t know that it exists and don’t need to change anything in how they query the dataset.

This was one of the key benefits to a multidimensional model. Shabnam has an excellent, detailed article here, so give it a read if you are a Power BI developer.

Comments closed

Profiling Hive Jobs on Tez

Dmitry Tolpeko takes us through Hive query diagnostics:

I was asked to diagnose and tune a long and complex ad-hoc Hive query that spent more than 4 hours on the reduce stage. The fetch from the map tasks and the merge phase completed fairly quickly (within 10 minutes) and the reducers spent most of their time iterating the input rows and performing the aggregations defined by the query – MIN, SUM, COUNT and PERCENTILE_APPROX and others on the specific columns.

After the merge phase a Tez reducer does not output many log records to help you diagnose the performance issues and find the bottlenecks. In this article I will describe how you can profile an already running Tez task without restarting the job.

Click through for the process, as well as the root cause of the problem.

Comments closed

Troubleshooting Tez Performance

Dmitry Tolpeko digs through Tez logs to figure out a performance issue:

Why did it take so long to run the job? Is there any way to improve its performance?

Tez Application Master Log
I am going to use the Tez AM log to investigate vertex performance and find possible bottlenecks.
Note that there is the Timeline Server REST API that you can use to get the statistics for Tez jobs, but the application master log is “event-driven”, shows the exact order of all events and contains much more details in general.

Click through for the process.

Comments closed

Plan Hashes and Ad-Hoc Workloads

Erin Stellato has an experiment for us:

Borrowing and adapting code from a previous post, Examining the Performance Impact of an Adhoc Workload, we will first create two stored procedures. The first, dbo.RandomSelects, generates and executes an ad hoc statement, and the second, dbo.SPRandomSelects, generates and executes a parameterized query.

Erin then shows how to review query stats and group together executions which are the same save for a change in literals. Read the whole thing.

Comments closed

Benchmarking JSON Query Times

Silvano Coriani compares different options for loading and querying JSON data in Azure SQL Database:

Storing and retrieving data from JSON fragments is a common need in many application scenarios, like IoT solutions or microservice-based architectures. These fragments can be persisted in a variety of data stores, from blob or file shares, to relational and non-relational databases, and there’s a long standing debate in the industry on what’s the database technology that fits “better” for this task.
 
Azure SQL Database offers several options for parsing, transforming and querying JSON data, and this article doesn’t pretend to provide a definitive answer to that debate, but rather to explore these options for common scenarios like data loading and retrieving, and benchmarking results to provide a clear indication of how Azure SQL Database will perform manipulating JSON data.

Read on for the results.

Comments closed

Query Folding with Power BI Dataflows

Matthew Roche shares a few important points about Power BI dataflows and query folding:

In a recent post I mentioned an approach for working around the import-only nature of Power BI dataflows as a data source in Power BI Desktop, and in an older post I shared information about the enhanced compute engine that’s currently available in preview.

Some recent conversations have led me to believe that I should summarize a few points about dataflows and query folding, because these existing posts don’t make them easy to find and understand.

Read on for those points.

Comments closed

Fixing Key Lookup Problems

Erik Darling has a couple techniques for mitigating key lookup-related performance problems:

They’re one of those things — I’d say even the most common thing — that makes parameterized code sensitive to the bad kind of parameter sniffing, so they get a lot of attention.

The thing is, most of the attention that they get is just for columns you’re selecting, and most of the advice you get is to “create covering indexes”.

That’s not always possible, and that’s why I did this session a while back on a different way to rewrite queries to sometimes make them more efficient. Especially since key lookups may cause blocking issues.

Read on to see what you can do when a covering index isn’t a viable option.

Comments closed

Thinking Like the SQL Server Engine

Brent Ozar has started a series based on a video of the same name:

7,405 pages is about 15 reams of paper.
You know those 500-page packs of paper that you put into the copier or the printer? (No? Do you remember copiers and printers? Honestly, me neither.) The Users table is one of the smallest tables in the Stack Overflow database export, but it’s still 15 of those packs.

As we work through demos in the upcoming posts, I want you to visualize a stack of 15 reams of paper over in the corner of your room. When I ask you to query the table, I want you to think about how you’d execute that as a human being facing data spread across 15 reams of paper. It’d be a hell of a lot of work, and you wouldn’t be so eager to go grab the first piece of paper to start work. You’d wanna build a really good plan before you go tackle that stack of paper.

That’s a lot of paper.

Comments closed

Predicting Application Problems from the Database

Ed Pollack has a pattern for rooting out application problems based on database activity:

We can approach I/O file stats very similarly to how we handled row counts above: Regularly collect data, store it in a reporting table, and then run analytics against it as needed. Since these database metrics are reset when SQL Server services restart, we need to collect a bit more often. We’ll also want to collect often enough to be able to correlate changes to ongoing application activity. Hourly is typically an acceptable collection frequency, but your environment may lend itself to the more frequent or less frequent collection.

What’s nice is that you can get a long way with heuristics and domain knowledge, even before applying data science techniques.

Comments closed