Performance Tuning – Page 25

If you have a slow DirectQuery report in Power BI one of the first questions you need to ask is how long the SQL queries that Power BI generates take to run. This is a more complicated question to answer than you might think, though, and in this post I’ll explain why.
I happen to have access to some of the famous New York taxi data in a Snowflake database, and in there is a table with trip data that has 173 million rows that I have a built a Power BI dataset from. The data and the database used are not really important here though – what is important is that it’s DirectQuery and a large-ish amount of data.

Read on for more information on how it all works.

Comments closed

Troubleshooting Code Performance in R

Published 2021-04-29 by Kevin Feasel

Mira Celine Klein shows how to benchmark R code performance:

Let’s assume you have written some code, it’s working, it computes the results you need, but it is really slow. If you don’t want to get slowed down in your work, you have no other choice than improving the code’s performance. But how to start? The best approach is to find out where to start optimizing.
It is not always obvious which part of the code makes it so slow, or which of multiple alternatives is fastest. There is the risk to spending a lot of time optimizing the wrong part of the code. Fortunately, there are ways to systematically test how long a computation takes. An easy way is the function system.time. Just wrap your code into this function, and you will (in addition to the actual results of that code) get the time your code took to run.

But that’s not the only route—read on to learn about other techniques as well and see them in action.

Comments closed

Importing Data from ADLS Gen2 into Power BI

Published 2021-04-14 by Kevin Feasel

Chris Webb summarizes a significant amount of work:

Over the last few months I’ve written a series of posts looking at different aspects of one question: what is the best way to import data from ADLSgen2 storage into a Power BI dataset? For example, is Parquet really better than CSV? Should you use Azure Synapse Serverless? In this post I’m going to summarise my findings and offer some recommendations – although, as always, I need to stress that these are the conclusions I can draw from my test results and not the absolute, incontrovertible “Microsoft-says” truth so please do your own testing too.

Read on and check it out for yourself.

Comments closed

Star Schemas and Power BI Go Together

Published 2021-04-13 by Kevin Feasel

Marco Russo and Alberto Ferrari explain why star schemas make so much sense for Power BI:

Why should I have products, sales, date and customers as separate tables? Wouldn’t it be better to store everything in a single table named Sales that contains all the information? After all, every query I will ever run will always start from Sales. By storing everything in a single table, I avoid paying the price of relationships at query time, therefore my model will be faster.
There are multiple reasons why a single, large table is not better than a star schema. Here anyway, the focus is strictly on performance. Is it true that a single table is faster than a star schema? After all, we all know that joining two tables is an expensive operation. So it seems reasonable to think that removing the problem of joins ends up in the model being faster. Besides, with the advent of NOSQL and big data, there are so many so-called data lakes holding information within one single table… Isn’t it tempting to use those data sources without any transformation?

Read on to see why this is not the case.

Comments closed

HammerDB CLI for Oracle Running on Azure

Published 2021-04-09 by Kevin Feasel

Kellyn Pot’vin-Gorman goes through a rough experience:

Disclaimer: I’m not a big fan of benchmark data. I find it doesn’t provide us as much value in the real world as we’d like to think it does. As Cary Milsap says, “You can’t hardware your way out of a software problem” and I find that many folks think that if they just get the fastest hardware, their software problems will go away and this just isn’t true. Sooner or later, it’s going to catch up with you- and it rarely tells you what your real database workload needs to run most efficiently or what might be running in your database that could easily be optimized to save the business time and money.
The second issue is that when comparing different workloads or even worse, different platforms or applications, using the same configuration can be detrimental to the benchmarks collected, which is what we’ll discover in this post.

That said, Kellyn dives into the problem and documents several of the issues in building out this test.

Comments closed

Comparing CSV to Parquet File Loading Performance in Power BI

Published 2021-04-05 by Kevin Feasel

Chris Webb has a comparison for us:

Earlier in this series on importing data from ADLSgen2 into Power BI I showed how partitioning a table in your dataset can improve refresh performance. In that post I used CSV files in ADLSgen2 as my source and created one partition per CSV file, but after my recent discovery that importing data from multiple Parquet files can be tuned to be a lot faster than importing data from CSV files, I decided to try creating partitions linked to Parquet files instead.

Click through for the experiment and its results.

Comments closed

Why is Power BI Slow?

Published 2021-04-01 by Kevin Feasel

Patrick LeBlanc has some bad news for us all:

We’ve seen people comment that Power BI is SLOW. But, what they really mean is your report is slow. Patrick breaks things down to get you pointed in the right direction.

Click through to see what might cause your Power BI report baby to be ugly.

Comments closed

Window Functions in Row and Batch Modes

Published 2021-04-01 by Kevin Feasel

Erik Darling digs into a new series:

To start things off, we’re going to talk about query plan patterns related to windowing functions.
There are several things to consider with windowing function query plans:
– Row vs Batch mode
– With and Without Partition By
– Index Support for Partition and Order By
– Column SELECTion
– Rows vs Range/Global aggregates
We’ll get to them in separate posts, because there are particulars about them that would make covering them all in a single post unwieldy.
Anyway, the first one is pretty simple, and starting simple is about my speed.

Read on for this quick coverage of row mode versus batch mode processing with respect to window functions.

Comments closed

Caching versus Persisting in Spark

Published 2021-03-31 by Kevin Feasel

The Hadoop in Real World team explains a subtle difference:

cache() and persist() functions are used to cache intermediate results of a RDD or DataFrame or Dataset. You can mark an RDD, DataFrame or Dataset to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, the objects behind the RDD, DataFrame or Dataset on which cache() or persist() is called will be kept in memory or on the configured storage level on the nodes.

That’s the similarity, but click through for the difference.

Comments closed

Spark Performance in Azure Synapse Analytics

Published 2021-03-31 by Kevin Feasel

Euan Garden shares some numbers around Apache Spark performance in Azure Synapse Analytics:

To compare the performance, we derived queries from TPC-DS with 1TB scale and ran them on 8 nodes Azure E8V3 cluster (15 executors – 28g memory, 4 cores). Even though our version running inside Azure Synapse today is a derivative of Apache Spark™ 2.4.4, we compared it with the latest open-source release of Apache Spark™ 3.0.1 and saw Azure Synapse was 2x faster in total runtime for the Test-DS comparison.

Click through for several techniques the Azure Synapse Analytics team has implemented to make some significant performance improvements. It’s still slower than Databricks, but considerably faster than the open-source Apache Spark baseline.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Performance Tuning

Measuring DirectQuery Performance

Troubleshooting Code Performance in R

Importing Data from ADLS Gen2 into Power BI

Star Schemas and Power BI Go Together

HammerDB CLI for Oracle Running on Azure

Comparing CSV to Parquet File Loading Performance in Power BI

Why is Power BI Slow?

Window Functions in Row and Batch Modes

Caching versus Persisting in Spark

Spark Performance in Azure Synapse Analytics