Press "Enter" to skip to content

Category: Performance Tuning

Importing Data from ADLS Gen2 into Power BI

Chris Webb summarizes a significant amount of work:

Over the last few months I’ve written a series of posts looking at different aspects of one question: what is the best way to import data from ADLSgen2 storage into a Power BI dataset? For example, is Parquet really better than CSV? Should you use Azure Synapse Serverless? In this post I’m going to summarise my findings and offer some recommendations – although, as always, I need to stress that these are the conclusions I can draw from my test results and not the absolute, incontrovertible “Microsoft-says” truth so please do your own testing too.

Read on and check it out for yourself.

Comments closed

Star Schemas and Power BI Go Together

Marco Russo and Alberto Ferrari explain why star schemas make so much sense for Power BI:

Why should I have products, sales, date and customers as separate tables? Wouldn’t it be better to store everything in a single table named Sales that contains all the information? After all, every query I will ever run will always start from Sales. By storing everything in a single table, I avoid paying the price of relationships at query time, therefore my model will be faster.

There are multiple reasons why a single, large table is not better than a star schema. Here anyway, the focus is strictly on performance. Is it true that a single table is faster than a star schema? After all, we all know that joining two tables is an expensive operation. So it seems reasonable to think that removing the problem of joins ends up in the model being faster. Besides, with the advent of NOSQL and big data, there are so many so-called data lakes holding information within one single table… Isn’t it tempting to use those data sources without any transformation?

Read on to see why this is not the case.

Comments closed

HammerDB CLI for Oracle Running on Azure

Kellyn Pot’vin-Gorman goes through a rough experience:

Disclaimer: I’m not a big fan of benchmark data.  I find it doesn’t provide us as much value in the real world as we’d like to think it does.  As Cary Milsap says, “You can’t hardware your way out of a software problem” and I find that many folks think that if they just get the fastest hardware, their software problems will go away and this just isn’t true.  Sooner or later, it’s going to catch up with you-  and it rarely tells you what your real database workload needs to run most efficiently or what might be running in your database that could easily be optimized to save the business time and money.

The second issue is that when comparing different workloads or even worse, different platforms or applications, using the same configuration can be detrimental to the benchmarks collected, which is what we’ll discover in this post.

That said, Kellyn dives into the problem and documents several of the issues in building out this test.

Comments closed

Comparing CSV to Parquet File Loading Performance in Power BI

Chris Webb has a comparison for us:

Earlier in this series on importing data from ADLSgen2 into Power BI I showed how partitioning a table in your dataset can improve refresh performance. In that post I used CSV files in ADLSgen2 as my source and created one partition per CSV file, but after my recent discovery that importing data from multiple Parquet files can be tuned to be a lot faster than importing data from CSV files, I decided to try creating partitions linked to Parquet files instead.

Click through for the experiment and its results.

Comments closed

Window Functions in Row and Batch Modes

Erik Darling digs into a new series:

To start things off, we’re going to talk about query plan patterns related to windowing functions.

There are several things to consider with windowing function query plans:

– Row vs Batch mode
– With and Without Partition By
– Index Support for Partition and Order By
– Column SELECTion
– Rows vs Range/Global aggregates

We’ll get to them in separate posts, because there are particulars about them that would make covering them all in a single post unwieldy.

Anyway, the first one is pretty simple, and starting simple is about my speed.

Read on for this quick coverage of row mode versus batch mode processing with respect to window functions.

Comments closed

Caching versus Persisting in Spark

The Hadoop in Real World team explains a subtle difference:

cache() and persist() functions are used to cache intermediate results of a RDD or DataFrame or Dataset. You can mark an RDD, DataFrame or Dataset to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, the objects behind the RDD, DataFrame or Dataset on which cache() or persist() is called will be kept in memory or on the configured storage level on the nodes. 

That’s the similarity, but click through for the difference.

Comments closed

Spark Performance in Azure Synapse Analytics

Euan Garden shares some numbers around Apache Spark performance in Azure Synapse Analytics:

To compare the performance, we derived queries from TPC-DS with 1TB scale and ran them on 8 nodes Azure E8V3 cluster (15 executors – 28g memory, 4 cores). Even though our version running inside Azure Synapse today is a derivative of Apache Spark™ 2.4.4, we compared it with the latest open-source release of Apache Spark™ 3.0.1 and saw Azure Synapse was 2x faster in total runtime for the Test-DS comparison.

Click through for several techniques the Azure Synapse Analytics team has implemented to make some significant performance improvements. It’s still slower than Databricks, but considerably faster than the open-source Apache Spark baseline.

Comments closed

Optimizing Power BI Data Load from a Folder of Parquet Files

Chris Webb has a tip for us:

In all the testing I’ve done recently with importing data from Parquet files into Power BI I noticed something strange: loading data from a folder containing multiple Parquet files seemed a lot slower than I would expect, based on the time taken to load data from a single file. So I wondered – is there something that can be optimised? It turns out there is and in this blog post I’ll show you what I did.

Click through to see how Chris cut load time down to approximately half what it was.

Comments closed

Cost versus Performance Optimization for SQL Server on VMs in Azure

Pam Lahoud takes a look at multi-constraint optimization:

So how do you get the best price-performance possible when configuring your SQL Server on Azure VM? In this blog, we’re going to cover three key aspects to right-sizing (and right-configuring) your Azure VM for SQL Server that are based on some common pitfalls customers face when migrating their on-premises workloads to Azure VM:

– Choosing the best VM series and size for your workload
– Configuring storage for maximum throughput and lower cost
– Leveraging unique to Azure features such as host caching to boost performance at no additional cost

One key point of the article is that there are several factors which can make a big difference in price and performance, but which you might not think about on-premises. It’s definitely worth taking the time to research this. It’s also a great example of how administrators are still important in a cloud-based world—having an admin who understands these settings and can get the most out of a given server can save a lot of money very quickly.

Comments closed