Press "Enter" to skip to content

Curated SQL Posts

Data Temperature in Microsoft Fabric

Marc Lelijveld breaks out the thermometer:

As part of Microsoft Fabric, a new storage mode to connect from Power BI to data in OneLake has been introduced. Direct Lake it makes to possible to use your data from OneLake in Power BI without taking an additional copy of the data. Where Direct Lake promises to deliver the performance of Import-mode with the real-time capabilities of Direct query, it is time to have a closer look how data gets loaded into memory and delving into the concept of data dictionary temperature.

In this blog I will explain when data gets loaded into memory, elaborate on how you can measure the dictionary temperature of your data and the effect of queries on the temperature.

Click through to see what affects this measure and how.

Comments closed

The Risk of Changing MaxDOP

Erik Darling recommends caution:

Like in yesterday’s post about Cost Threshold For Parallelism, changing MAXDOP settings will have a universal effect on the workload.

This is true whether you change it at the server level for all databases, or at the database level using a database scoped configuration for a single database.

It is a guardrail to prevent unwanted conditions as a whole, like excessive concurrent parallel queries causing worker thread starvation (THREADPOOL waits), or just pushing CPU to 100% for extended periods of time.

Read on to see what Erik recommends you think about after any MaxDOP change.

Comments closed

Faceted Images in ggplot2

Steven Sanderson shows multiple plots on one image:

Data visualization is a crucial tool in the data scientist’s toolkit. It allows us to explore and communicate complex patterns and insights effectively. In the world of R programming, one of the most powerful and versatile packages for data visualization is ggplot2. Among its many features, ggplot2 offers the facet_grid() function, which enables you to create multiple plots arranged in a grid, making it easier to visualize different groups of data simultaneously.

In this blog post, we’ll dive into the fascinating world of facet_grid() using a practical example. We’ll generate some synthetic data, split it into multiple groups, and then use facet_grid() to create a visually appealing grid of plots.

Read on for the demo script. The text talks about facet_grid() and the demo is facet_wrap(). The two behave very similarly, though they have slightly different use cases.

Comments closed

Building a Retry Mechanism for sqlcmd in Bash

Jose Manuel Jurado Diaz won’t let failure get him down:

Introduction:

Efficiently managing temporary failures and timeouts is crucial in production environments when connecting to databases. In this article, we’ll explore how to implement a retry mechanism with sqlcmd in a Bash script, dynamically increasing timeouts with each failed attempt.

Problem Statement:

Operations can fail due to network issues, overloaded servers, or other temporary problems when interacting with databases. Implementing a retry mechanism helps address these temporary issues without manual intervention.

Read on for the solution script. You could also adapt this to Powershell fairly easily, I think, though if you do go down that road, I’d recommend taking a look at Polly and PsPolly.

Comments closed

Thoughts on Cost Threshold for Parallelism

Erik Darling has some thoughts:

First, I’m not suggesting that anyone should be using the default value for Cost Threshold For Parallelism. It’s old and moldy and not a good fit for most workloads functioning on modern hardware.

My apologies to Azure SQLDB users who can’t change this setting and leave it up to Microsoft to maybe manage it for them based on ???

Some people out there really like fiddling with settings in a usually ill-informed reaction to Some Script They Found On The Internet, without reading the fine print.

Erik’s thoughts are reasonable overall. My recommendation is to use Michael J. Swart’s technique for tuning cost threshold for parallelism as a starting point, as it gives you a basis for what the net effect of your changes are.

Comments closed

Visualizing Data in R with ggplot2

Adrian Tam continues a series on R:

One of the most popular plotting libraries in R is not the plotting function in R base, but the ggplot2 library. People use that because it is flexible. This library also works using the philosophy of “grammar of graphics”, which is not to generate a visualization upon a function call, but to define what should be in the plot, and you can refine it further before setting it into a picture. In this post, you will learn about ggplot2 and see some examples. In particular, you will learn:

  • How to make use of ggplot2 to create a plot from a dataset
  • How to create various charts and graphics with multiple facades using ggplot2

It takes a little while to understand the grammar of graphics approach that ggplot2 takes, but once you do, you realize just how good this library is for generating static images.

Comments closed

Using DVC to Store Data Science Artifacts in Azure

I have a new video up:

In this video, we introduce DVC, a tool for version control management of data science and machine learning artifacts. We learn why Git isn’t the best place to store those large data files, how DVC integrates with Git, and how you can save your files in Azure Blob Storage.

Click through for the video, as well as a variety of links which helped me put it together.

Comments closed

SparkSQL CONCAT vs T-SQL CONCAT

Bill Fellows has a public service announcement:

The concat function is super handy in the database world but be aware that the SQL Server one is way better because it solves two problems. It combines everything into a string and it does not require NULL checking. In the before times, one had to down cast to a n/var/char type as well as check for NULL before appending strings via the plus sign.

The point of difference is so important that Bill busted out the marquee HTML tag. Which now leads me to wonder, was marquee or blink the bigger evil in the mid-to-late ’90s web?

Comments closed