Press "Enter" to skip to content

Author: Kevin Feasel

Comparing Lakehouse and Warehouse Performance again

Gilbert Quevauvilliers provides some more comparisons:

I learnt a lot and based on the feedback people asked for me to compare the Lakehouse vs the Warehouse with 1 billion rows.

What I also did this time was to optimize anything I could with regards to loading data into the Lakehouse or the Warehouse based on the feedback I received.

Below is a list of the changes I made

Read on for those changes and how they affected performance. That’s the tricky part about performance comparisons: unless you know how to tweak all options equally, you can end up with skewed results.

I’d also be interested in how the Eventhouse fares. I believe that, when it comes to data retrieval, the Eventhouse is the fastest option available to us.

Leave a Comment

An Introduction to MicrosoftFabricMgmt

Rob Sewell has a series of posts on MicrosoftFabricMgmt. The first post provides an introduction:

I have been introducing the Microsoft fabric-toolbox — covering the toolbox itselfFUAM, and FCA. All excellent tools. But there is one item in the toolbox that I have been personally involved in building, and it is the one I am most excited to write about.

Today I am kicking off a series of posts about MicrosoftFabricMgmt — an enterprise-grade PowerShell module that gives you comprehensive, scriptable control over the entire Microsoft Fabric REST API. It is hosted as part of the fabric-toolbox on GitHub.

The second post covers installation and authentication:

Yesterday I introduced the MicrosoftFabricMgmt module and explained what it can do. Today we are getting hands on — installing the module, sorting out dependencies, and making your first connection to Microsoft Fabric.

By the end of this post you will have the module installed, be authenticated, and have your first list of Fabric workspaces in your terminal.

The third post involves not having to deal with a bunch of GUIDs:

Which workspace is 948d3445-54a5-4c2a-85e7-2c3d30933992? Which capacity? Who knows — go look it up. Multiply that by fifty items across ten workspaces and you have a frustrating afternoon ahead of you.

The PowerShell Module**MicrosoftFabricMgmt** solves some of this frustration.

Leave a Comment

Debugging DAX Variables via TOJSON() and TOCSV()

Marco Russo and Alberto Ferrari write out some intermediate results:

In a previous article, Debugging DAX measures in Power BI, we described several techniques to find errors in a DAX formula. The most basic approach, one that requires no external tools, is to temporarily change the RETURN statement of a measure so that it returns the value of an intermediate variable instead of the final result. When the variable contains a scalar value such as a number or a string, this is straightforward: you change the RETURN, observe the result in the report, and compare it with your expectations.

Read on to see how these functions work.

Leave a Comment

Architecting Your First Microservice

Bijoy Choudhury builds a process:

In any microservices migration, extracting services from all their dependencies and point-to-point integrations carries the most risk. If you feel hesitant about decomposing your application, that hesitation is justified. The first service extraction is uniquely challenging because you have to examine years of accumulated technical debt and unresolved organizational decisions at the same time. 

That’s why the objective for the first service extraction should not focus on achieving immediate scalability or to redefine organizational practices but to validate a narrow capability. Instead, it’s about identifying a discrete unit of functionality that can be isolated, deployed independently, and integrated with the existing system without rewriting the entire system or introducing instability.

There’s some good advice in here, as well as one reason why I’m not totally sold on microservices: the isolation of databases. This sounds great until you’re hitting seven different services to retrieve data 100x slower than a simple SQL query would have been because you have complex filtering criteria across these seven services. And then you build an extra layer of caching, introducing even more complexity to solve a problem that never needed to exist.

Leave a Comment

Performance Tuning Dependent SQL Queries in DirectQuery Mode

Chris Webb tries a change:

As I described here, Power BI can send SQL queries in parallel in DirectQuery mode and you can see from the Timeline column there is some parallelism happening here – the last two SQL queries generated by the DAX query run at the same time – but everything has to wait for that first SQL query to complete. Why? Can this be tuned?

Click through for an example. I was thinking about how challenging it would be to improve this performance at the SQL query level and if you could build a single query that operates over all three sets of data—distinct customers, distinct customers on Mondays, distinct customers in Januaries–while still performing acceptably. I’m not sure that the variants I sketched out in my head would actually perform faster, thanks to the “distinct” requirements.

Leave a Comment

Working with Recent Data in Dataflows Gen2

Penny Zhou sees recent datasets:

How much time do you spend navigating to the same data sources when building dataflows? Data preparation is an iterative process—you often return to the same sources as you refine your dataflows, add new transformations, or create similar workflows. If you find yourself repeatedly connecting to the same tables, files, or databases, the Recent data module in Dataflow Gen2 is designed for you. This feature reduces friction by providing quick access to your most frequently used data items, letting you focus on the transformation logic rather than navigation.

Click through to see how you can access the Recent data menu and what it includes.

Leave a Comment

A Primer on Data Storage in PostgreSQL

Grant Fritchey shares some thoughts:

The whole idea behind a database is the ability to persist the data. You want your inventory of widgets to get stored so you can look at it later. That means writing out to disks. However, what is writing to disk and where is it being written? Unlike SQL Server which has one (or more) big file for all data, PostgreSQL has a collection of a large number of files. There is a methodology and structure to these files that you need to understand in order to later understand how the data gets written to and retrieved from these files.

While we’re going to be very focused on file, page, folder, etc., throughout this article, that’s just part of the physical nature of persisting your data. What is being persisted is still the logical information you’re most interested in – rows and columns. I just wanted to emphasize the distinction between the two here.

Click through to see how PostgreSQL stores information.

Leave a Comment

Spark Schema Inference in Production

Miles Cole shares some advice:

To show the impact I want to highlight a benchmark that included Fabric Spark on a single 19GB CSV input file (100M Contoso dataset, sales table) for the benchmark. While there were a number of issue with this benchmark that inadvertently make Spark appear to be slow, this is only focused on the impact of inferring schema and practical recommendations.

Read on to see a performance problem that schema inference brings up. I’d also want to mention the risk of data updates blowing up your well-laid plans as a risk. Schema inference is a double-edged sword: it can be convenient and open up new approaches to development, but can just as easily cause unexpected failures.

Leave a Comment

A Primer on dbt against DuckDB

Robin Moffatt shares a tutorial on dbt:

In 2022 I made a couple of attempts to learn dbt, but it never really ‘clicked’.

I’m rather delighted to say that as of today, dbt has definitely ‘clicked’. How do I know? Because not only can I explain what I’ve built, but I’ve even had the 💡 lightbulb-above-the-head moment seeing it in action and how elegant the code used to build pipelines with dbt can be.

In this blog post I’m going to show off what I built with dbt, contrasting it to my previous hand-built method.

I also had heard of dbt but haven’t really spent the time to learn it because I’m not really a data engineer. But this tutorial has me interested in diving in further.

Leave a Comment