Press "Enter" to skip to content

Category: Performance Tuning

Performance Tuning Load of Partitioned Hive Tables on S3 with Spark

Dmitry Tolpeko walks us through a performance problem in Spark:

I have a Spark job that transforms incoming data from compressed text files into Parquet format and loads them into a daily partition of a Hive table. This is a typical job in a data lake, it is quite simple but in my case it was very slow.

Initially it took about 4 hours to convert ~2,100 input .gz files (~1.9 TB of data) into Parquet, while the actual Spark job took just 38 minutes to run and the remaining time was spent on loading data into a Hive partition.

Let’s see what is the reason of such behavior and how we can improve the performance.

Read on to see how.

Comments closed

Costs and Benefits of Recompilation

Erik Darling takes us through the pros and cons of slapping that RECOMPILE label on a query:

It’s been a while since SQL Server has had a real RECOMPILE problem. And if you put it up against the performance problems that you can hit with parameter sniffing, I’d have a hard time telling someone strapped for time and knowledge that it’s the worst idea for them.

Obviously, you can run into problems if you (“you” includes Entity Framework, AKA the Database Demolisher) author the kind of queries that take a very long time to compile. But as I list them out, I’m kinda shrugging.

My rule of thumb is that WITH RECOMPILE isn’t the first answer, but I won’t mess around too long before going to it.

Comments closed

The Benefits of DAX Variables

Reza Rad explains why you should use DAX variables if you’re repeating calculations:

We have to main parts in the expression above: A and B. Each of those is doing a calculation. Now, with the markings above, reading the expression is much simpler. The whole expression means this:

=IF(A>B, A, B)

All the above expression is saying is that if A is bigger than B, then return A, otherwise B. Now it is much simpler to read it because we split the repetitive parts into sections. That is what exactly the DAX variable is for.

Readability is not the only benefit, however. Reza has more.

Comments closed

What Slows Down Clustered Index Rebuilds

Kevin Chant has a few reasons why you might see slow clustered index rebuilds in your environment:

I better point out that online rebuilds in general tend to take longer. Mostly because behind the scene’s it’s making a rebuilt copy of your index and then it swaps around to the new index once it has completed.

However, there is another key point I should mention here.

Kevin also points out a sub-item for online rebuilds which could fit just as well in offline rebuilds: if there’s a long-running transaction which blocks SQL Server from taking the schema modification lock, you’ll be sitting there until those long-running transactions ahead of you finish.

Comments closed

Power BI Aggregation Precedence

Shabnam Watson asks and answers a question of importance:

Precedence is one of the aggregation properties that you can define on an aggregation table in Power BI. In a model with multiple aggregation tables, you can use Precedence to define the order in which aggregation tables will be considered by Power BI to answer queries. The higher the Precedence number, the sooner the aggregation table will be considered. Very simple and easy to understand. but the question is:

What does Power BI do when there are multiple aggregation tables configured with the same Precedence value and they can all answer the same query? Which one is considered first? Does it choose the smallest one? or is there another rule in place?

Click through to find out.

Comments closed

Testing In-Browser Power BI Report Performance

Chris Webb gives us some tips on testing Power BI reports in a web browser:

It turns out that testing performance of a report in the browser is not as straightforward as it seems. In this post I’m going to describe some of the factors you have to take into account when doing this type of testing; in the next post I’ll go into more detail about how you actually measure report rendering times in the browser and how to see what happens when the report is rendered.

Click through for those factors.

Comments closed

Shrinking Your Power BI Dataset Sizes

Gilbert Quevauvilliers wanted to reduce Power BI memory usage:

I had already applied all the best practices in terms of reducing the cardinality, removing unwanted columns and making sure that only the data required is being brought into the dataset. Even at this point the dataset size was consuming 90GB of memory in Azure Analysis Services. With the steps below I got my dataset size down to a whopping 37GB of memory!

I used the awesome tools from SQLBI.COM and DAX Studio to see which columns were consuming the most space, and because my dataset had currency converted values, this meant that the cardinality was very high. (The reason that I decided to store the currency conversion values, is when trying to do it on the fly in a large dataset it is very slow)

Two simple tricks led to a pretty nice reduction in size.

Comments closed

Aggregations in Power BI

Shabnam Watson takes us through aggregations in Power BI:

In Power BI, Aggregations start as tables just like any other table in a model. They can be based off a view or table in the source database, or created in Power BI with Power Query. They can be in Import or Direct Query storage mode.

Once in the model, these tables can be configured so that the engine can use them instead of a detail table to answer queries when possible. The process of creating and configuring aggregations in Power BI is significantly easier than the process of creating aggregations in SSAS multidimensional.

Once an aggregation table is configured, it becomes hidden from end users. Report developers and end users don’t know that it exists and don’t need to change anything in how they query the dataset.

This was one of the key benefits to a multidimensional model. Shabnam has an excellent, detailed article here, so give it a read if you are a Power BI developer.

Comments closed

Profiling Hive Jobs on Tez

Dmitry Tolpeko takes us through Hive query diagnostics:

I was asked to diagnose and tune a long and complex ad-hoc Hive query that spent more than 4 hours on the reduce stage. The fetch from the map tasks and the merge phase completed fairly quickly (within 10 minutes) and the reducers spent most of their time iterating the input rows and performing the aggregations defined by the query – MIN, SUM, COUNT and PERCENTILE_APPROX and others on the specific columns.

After the merge phase a Tez reducer does not output many log records to help you diagnose the performance issues and find the bottlenecks. In this article I will describe how you can profile an already running Tez task without restarting the job.

Click through for the process, as well as the root cause of the problem.

Comments closed

Troubleshooting Tez Performance

Dmitry Tolpeko digs through Tez logs to figure out a performance issue:

Why did it take so long to run the job? Is there any way to improve its performance?

Tez Application Master Log
I am going to use the Tez AM log to investigate vertex performance and find possible bottlenecks.
Note that there is the Timeline Server REST API that you can use to get the statistics for Tez jobs, but the application master log is “event-driven”, shows the exact order of all events and contains much more details in general.

Click through for the process.

Comments closed