Press "Enter" to skip to content

Curated SQL Posts

Adding Patterns to ggplot2 Plots

Zhenguo Zhang adds some patterns:

Adding patterns to plots is a great way to improve accessibility (making plots colorblind-friendly) and to add an extra dimension of information. The ggpattern package provides a rich set of tools to achieve this in ggplot2.

I’m personally not the biggest fan of patterns. I see them as a point of necessity when dealing with grayscale circumstances, such as printing out a chart in an academic journal. But it’s very easy to overdo patterns and end up making a mess of the visual.

But one side note about color vision deficiency and plots: make sure that your plots are monochrome-friendly because somebody probably will try to print out your chart or view it on a grayscale-only device. Or might actually be monochromatic.

Leave a Comment

The Disappointment of Parameter-Sensitive Plan Optimization

Hugo Kornelis is back with another video:

Bad hair day? Try having an almost-no-hair month!

Jokes aside. It has been almost six weeks since my last video blog. Not really the schedule I had planned. But I believe I have good reasons.

Anyway, I do have a new video ready now. As promised in my last video, I now cover Parameter Sensitive Plan Optimization (PSPO), a new feature, introduced in SQL Server 2022, that is supposed to alleviate the pain of bad parameter sniffing.

Read on for the promise and letdown of PSPO.

Leave a Comment

An Overview of Power BI File Formats

Reza Rad covers a multitude of formats:

Power BI files come in multiple formats. If you have been working with Power BI Desktop for a while, you probably know the PBIX format well. But there are other formats too: PBIT, PBIP, and PBIR. The last two are more recent additions, and there is also a language called TMDL (Tabular Model Definition Language) that comes with its own file type.

In this blog post, I am going to explain the differences between all of these formats, why each one exists, and which one you should use for your scenarios. I will also show you, through a practical demo, what PBIP looks like and how you can use it to make your development process significantly better.

Click through for the video and article.

Leave a Comment

Foreign Tables and Materialized Views in PostgreSQL

Richard Yen provides a write-back system for analysts:

I recently wrote a post about WAL log shipping and how a standby built on log shipping is a great way to give data analysts production data without putting the primary at risk. Having access to the production data in this way is great, but it’s read-only. How can we create views of this data for better analytics work? I want to make the case today that Foreign Data Wrappers and Materialized Views can make a great solution – not only in accessing production Postgres data, but also working with other data sources.

Click through for an architectural discussion of how they can work together.

Leave a Comment

An FAQ-Based Introduction to Data Factories in Azure

Koen Verbeeck answers some questions:

Is Microsoft Fabric replacing Azure Data Factory?

Officially, no. Or maybe not yet. At the time of writing, ADF still remains a separate product but it’s noticeable that more new features are added to Fabric than to ADF. There are still many customers using ADF, so Microsoft might keep the service around for a while. There’s also still a bit of a feature gap between the two services, but this becomes more narrower each month. Microsoft is offering migration scenarios from ADF to Fabric.

I picked this question because of how much the concept annoys me. There are three separate Data Factory code bases in Azure with overlapping but not matching functionality (which is how you can tell it’s multiple code bases and not just one code base reskinned). This can lead to a scenario where Person A says, “Oh, do this thing in Data Factory.” Person B then says, “But I can’t do that in Data Factory.” Person A’s response: “Oh, that’s weird, because I can do it in Data Factory.” This leads to necessary but somewhat absurd clarifications around how you need to use Microsoft Fabric Data Factory, not Azure Data Factory because, even though Microsoft Fabric Data Factory is hosted in Azure, it’s a different product.

And don’t get me started on the wide variety of KQL platforms, all of which are subtly different.

Leave a Comment

Concurrent Evaluation with Microsoft Fabric Dataflows Gen2

Chris Webb runs multiple jobs at once:

Did you know that if your Fabric Dataflows Gen2 contains several queries then you can control how many of them are evaluated in parallel when your dataflow refreshes? In this series I’ll look at how how you can do this and how it may result in better performance – at least in some cases.

Let’s start with the basics. I created a Dataflow Gen2 with ten queries which each returned a table of one row and one column after one minute. I used the #table function to generate the table without connecting to a data source, code from this post to add the delay and the trick in this post to make sure the delay was only applied when the dataflow refreshed. The output of each query was loaded to a Fabric Warehouse.

Click through for a demonstration.

Leave a Comment

Estimating Probabilities from Unevenly Collected Data

Nina Zumel answers an important question:

In this article, we look at the problem of estimating and comparing probabilities about a population of subjects from unevenly collected observations. Some examples might include:

  • The perceived quality of a movie (how often is a movie positively reviewed) when some movies have far more reviews than others.
  • The effectiveness of various ad campaigns, when some compaigns have had more exposure than others.
  • The efficacy of a certain medical procedure by hospital, when some hospitals have had more cases than others.

For our specific task, we’ll try to estimate the “innate” batting ability (the probability of making a hit when at bat) of major league baseball players in 2023. For the sake of this article, we will take this single season of data as everything that we know about these players and their batting statistics.

It’s an interesting problem because she’s looking at 2023 data as an estimation of the player’s entire career, with the goal of estimating how a player will perform overall given a fairly reasonably sized sample of information collected from one relatively short period of that player’s career. H/T John Mount.

Leave a Comment

A Challenge of Visualizing Game Statistics

Kieran Healy scratches his head:

I just finished driving a very long way up the side of the country, so I’m kind of tired. But even allowing for that, boy, this way of representing things really is quite confusing. Not being an Apple Sports user I had to look at it for a bit to understand what was happening. But, now that it has given me a headache, I can kind of see why whoever designed this ended up in the undoubtedly bad place they did.

Before I get to why I have some sympathy for the designer, why did I find this representation of these numbers so disorienting? It’s not just just because I’ve been driving for nine hours. John is right to call the picture a “Zero Sum” representation. The design strongly suggests to the viewer that, within each row, we’re looking at each team’s share of a total. Each pair of black and blue lines seem to be vying for control of their whole row, with the longest line being the “winner” in each case.

Click through for the challenge, as well as a trio of attempts to improve the results. The tornado chart at the end is probably what I’d go with if I needed to include all of this on a single chart. H/T R-Bloggers.

Leave a Comment

A Look at Tabular Foundation Models

Michael Mayer tries out a neural network model:

Tabular data has had a comfortable life for years. Gradient boosting showed up, got very good at its job, and then quietly became the default answer to almost everything with rows and columns.

In very recent years, a new player has arrived: the tabular foundation model or prior fitted neural network, and suddenly tabular data is sounding a lot less sleepy…

I’ve done a bit with TabPFN and come away fairly impressed. I’ll have to give this a go as well. There are definite limitations to data sizes before things fall over, but for moderate sizes (50k or fewer rows), TabPFN at least worked pretty well.

Leave a Comment

Dimensional Testing in Kafka

Jack Vanlightly announces a new tool:

Most of my career in distributed systems has been as a tester, performance engineer and formal verification specialist. I’ve written performance benchmarking tools in the past, for RabbitMQ and Apache Pulsar but in recent years I’ve used OpenMessagingBenchmark (OMB) to run benchmarks against Apache Kafka and other messaging systems. But OMB is hard to deploy and has several limitations compared to more sophisticated benchmarking systems I’ve developed in the past. With Claude becoming so much better since Christmas I decided to write a Kafka-centric performance benchmarking tool, with a lot of inspiration from OMB. I took the bits I like about OMB and the things I like about the tooling I’ve built in the past, to make a performance testing tool for testing Apache Kafka.

Click through for an overview of the tool and how it works.

Leave a Comment