Press "Enter" to skip to content

Curated SQL Posts

Fabric Benchmarking: Moving CSV Files

Eugene Meidinger breaks out the abacus:

First, a disclaimer: I am not a data engineer, and I have never worked with Fabric in a professional capacity. With the announcement of Fabric SQL DBs, there’s been some discussion on whether they are better for Power BI import than Lakehouses. I was hoping to do some tests, but along the way I ended up on an extensive Yak Shaving expedition.

I have likely done some of these tests inefficiently. I have posted as much detail and source code as I can and if there is a better way for any of these, I’m happy to redo the tests and update the results.

Part one focuses on loading CSV files to the files portion of a lakehouse. Future benchmarks look at CSV to delta and PBI imports.

I think Eugene did a fine job documenting everything in the process, and it was interesting to see relative price differences between different techniques for uploading a very large CSV file.

Comments closed

Working with the Azure AI Document Service

Tomaz Kastrun continues a series on Azure AI. First up is a visual review of the Azure AI Document service:

Vision and Document services gives your apps the ability to analyze images, process documents and use technologies for optical character recognition (OCR) with combinations to machine learning.

That product has gone through a few name iterations, including Document Recognizer. But wait, there’s more!

Tomaz also takes a look at the Python SDK:

Vision and Document SDK for Python gives you extra extensibility of the services to add it to your apps.

Using Vision and Document SDK with Python, you will need to have the resource up and running (for the starters go with free pricing tier (F0)) and get the Document intelligence API Key and Endpoint address.

Click through for an example of how that works.

Comments closed

Thoughts on Data Document Formats

Phil Factor shares some musings:

What can be so difficult in creating a sensible standard for Structured Data Documents? To understand why they tend to get improved into unusable complexity, I’ll need to explain a bit of background.

Structured Data Documents come in three different flavors. There are the text files that represent object data, text files that represent tabular data (rows and columns) and text data for the values of the settings, initialization or configuration of applications.

Read on for Phil’s take on the matter.

Comments closed

Fractional Path Performance Issues in Postgres Partitioned Tables

Andrei Lepikhov digs into an interesting finding:

While the user notices the positive aspects of technology, a developer, usually encountering limitations, shortcomings or bugs, watches the product from a completely different perspective. The same stuff happened at this time: after the publication of the comparative testing results, where Join-Order-Benchmark queries were passed on a database with and without partitions, I couldn’t push away the feeling that I had missed something. In my mind, Postgres should build a worse plan with partitions than without them. And this should not be just a bug but a technological limitation. After a second thought, I found a weak spot – queries with limits.

Read on to see what Andrei came up with.

Comments closed

Azure SQL DB String Concatenation and JSON Functions

Magda Bronowska takes a look at some functionality currently available only in Azure SQL Database and Managed Instance:

Microsoft releases the classic SQL Server every couple of years, with some functionality added through regular updates. On the other hand, the SQL Server offering in Azure (Azure SQL Database and Managed Instance) receives the latest features earlier.

This post highlights some of the T-SQL functions currently available in Azure SQL but not yet in classic SQL Server. However, with the recent announcement of SQL Server 2025, this might change next year. Keep in mind that some of these functions are in preview, so their behavior might evolve as they reach general availability.

Click through for those examples.

Comments closed

Running Oracle on Windows

Kellyn Gorman embraces the better part of valor:

For many DBAs, the thought of running Oracle on a Windows OS induces a collective cringe. Even for someone like me, with a career spanning both Microsoft and Oracle technologies, it’s a combination I typically avoid.

However, there are scenarios—driven by licensing, software requirements, or other factors—where deploying Oracle on Windows becomes the logical choice.

Read on for some pain points and a few tips to minimize them.

Comments closed

Sending Alerts from Fabric Workspace Monitoring

Chris Webb has a new Bat-signal:

I’ve always been a big fan of using Log Analytics to analyse Power BI engine activity (I’ve blogged about it many times) and so, naturally, I was very happy when the public preview of Fabric Workspace Monitoring was announced – it gives you everything you get from Log Analytics and more, all from the comfort of your own Fabric workspace. Apart from my blog there are lots of example KQL queries out there that you can use with Log Analytics and Workspace Monitoring, for example in this repo or Sandeep Pawar’s recent post. However what is new with Workspace Monitoring is that if you store these queries in a KQL Queryset you can create alerts in Activator, so when something important happens you can be notified of it.

Read on to learn more.

Comments closed

VACUUM FULL in PostgreSQL

Umair Shahid goes full vacuum and you never go full vacuum:

If you have worked with PostgreSQL for a while, you have probably come across the command VACUUM FULL. At first glance, it might seem like a silver bullet for reclaiming disk space and optimizing tables. After all, who would not want to tidy things up and make their database more efficient, right?

But here is the thing: while VACUUM FULL can be useful in some situations, it is not the hero it might seem. In fact, it can cause more problems than it solves if you are not careful.

Read on to learn what it does and why it’s not always a good idea.

Comments closed

Churn Analysis using Logistic Regression in Python

Daniel Calbimonte takes us through a churn analysis scenario:

This article explains how to analyze the data using Python and perform customer churn analysis to determine why customers stop using a service.

Read on for the article. If you want to dig deeper into churn analysis, I can recommend a book entitled Fighting Churn with Data. Its focus is more on categorical and numerical analysis rather than using statistical classification techniques like logistic regression to identify churn factors. That makes it easier to digest for non-statisticians, especially because most of the code is SQL.

Comments closed

The Showdown: Spark vs DuckDB vs Polars in Microsoft Fabric

Miles Cole puts together a benchmark:

There’s been a lot of excitement lately about single-machine compute engines like DuckDB and Polars. With the recent release of pure Python Notebooks in Microsoft Fabric, the excitement about these lightweight native engines has risen to a new high. Out with Spark and in with the new and cool animal-themed engines— is it time to finally migrate your small and medium workloads off of Spark?

Before writing this blog post, honestly, I couldn’t have answered with anything besides a gut feeling largely based on having a confirmation bias towards Spark. With recent folks in the community posting their own benchmarks highlighting the power of these lightweight engines, I felt it was finally time to pull up my sleeves and explore whether or not I should abandon everything I know and become a DuckDB and/or Polars convert.

Read on for the method and results from several thoughtful tests.

Comments closed