Press "Enter" to skip to content

Curated SQL Posts

Taking a Billion Taxi Rides with DuckDB

Mark Litwintschik tries out DuckDB:

DuckDB is an in-process database. Rather than relying on a server of its own, it’s used as a client. The client can work with data in memory, within DuckDB’s internal file format, database servers from other software developers and cloud storage services such as AWS S3.

This choice to not centralise DuckDB’s data within its own server, paired with being distributed as a single binary, makes installing and working with DuckDB much less complex than say, standing up a Hadoop Cluster.

The project isn’t aimed at very large datasets. Despite this, its ergonomics are enticing enough and it does so much to reduce engineering time that workarounds are worth considering. The rising popularity of analysis-ready, cloud-optimised Parquet files is removing the need for substantial hardware when dealing with datasets in the 100s of GBs or larger.

Read on to learn more about DuckDB, how it differs from SQLite, and a bit of nuttiness around how far you can push an in-memory database.

Comments closed

Restorable Dropped Databases Naming in Azure SQL DB

Tanayankar Chakraborty asks, what’s in a name?:

An issue was reported recently where the customer complained that in their cost analysis report of their Azure SQL DBs, the db name appears appended with a comma(,) and a number. While they agreed with the DB name in the report, they didn’t understand the number after the comma and its significance. This is how the cost analysis report looks like:

Click through for a redacted version of the report, showing an example of the database in question, as well as an explanation of what this number means.

Comments closed

Creating Dynamic Moving Averages with Visual Calcs and Numeric Parameters

Erik Svensen builds a dynamic moving average:

With the introduction of visual calculation in the February 2024 release of Power BI desktop (https://powerbi.microsoft.com/en-us/blog/visual-calculations-preview/) – this gives us some new possibilities to add calculations on the individual visual and some new functions gives us some exiciting options.

One example could be to use the MOVINGAVERAGE function (link) to and combine it with numeric range parameter to make it dynamic.

Click through for a video and a description of how to do it.

Comments closed

A Dive into Direct Lake

Nikola Ilic digs into Direct Lake:

The most common question I’m hearing these days from clients is – how can we refresh the Direct Lake semantic model? It’s a fair question. Since they have been relying on Import mode for years, and Direct Lake promises an “import mode-like performance”…So, there has to be a similar process in place to keep your data up to date, right?

Read on to learn the answer to this question and quite a few more.

Comments closed

Thoughts on Common Table Expressions

Erik Darling has opinions:

Much like joins and Venn diagrams, anyone who thinks they have some advanced hoodoo to teach you about common table expressions is a charlatan or a simpleton. They are one of the least advanced constructs in T-SQL, and are no better or worse than any other abstraction layer, with the minor exception that common table expressions can be used to build recursive queries.

As I read through the post, I kept wanting to disagree with Erik more than I do. My short form is, I aesthetically prefer common table expressions to subqueries. But that doesn’t make CTEs faster.

Comments closed

Support for Iterators in SQL Server 2022 Password Validation

Michael Howard has an update for us:

We all know that as security threats evolve, we must update our defenses to mitigate newer threats. Over the last few months, some customers have asked us to strengthen the way we secure passwords in SQL Server. The most often-cited reference by customers is to comply with NIST SP 800-63b.

Currently supported versions of SQL Server and Azure SQL DB use a SHA-512 hash with a 32-bit random and unique salt. It is statistically infeasible for an attacker to deduce the password knowing just the hash and the salt. It is considerably easier for an attacker to hunt for insecure storage of database connection strings that contain database credentials than it is to break the password verifier (also called a password authenticator) used by SQL Server and Azure SQL DB. But that’s a discussion for another day and is the main reason we highly recommend using Entra ID authentication rather than using uid/pwd-based connections because Entra ID authentication manages credentials and supports access policies.

Michael gives you an idea as well of why iterators are valuable and how you can tell if your SQL authenticated accounts are using this new setup.

I’m curious how much of an effect this will have on login time. Typically, using PBKDF (or PBKDF2) does increase login time by a couple hundred milliseconds—which is the entire point of the exercise, ensuring that an attacker cannot feasibly brute force passwords in reasonable time—so I would not expect the difference to be too noticeable. For a really busy system, however, I could see some small impact.

Comments closed

Plotting Training and Testing Results with tidyAML

Steven Sanderson builds a plot:

In the realm of machine learning, visualizing model predictions is essential for understanding the performance and behavior of our algorithms. When it comes to regression tasks, plotting predictions alongside actual values provides valuable insights into how well our model is capturing the underlying patterns in the data. With the plot_regression_predictions() function in tidyAML, this process becomes seamless and informative.

Read on to see how the function works and the kind of result you can expect from it.

Comments closed

Feature Engineering with Azure ML and Microsoft Fabric

Siliang Jiao, et al, talk architecture:

Feature engineering is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data. The extracted features are used for training the models that can predict values for relevant business scenarios. A feature engineering system provides the tools, processes, and techniques used to perform feature engineering consistently and efficiently. 

This article elaborates on how to build a feature engineering system based on Azure Machine Learning managed feature store and Microsoft Fabric. 

Click through to see how the pieces fit together.

Comments closed

Copilot in Microsoft Fabric Dataflows Gen2

Reza Rad shows off a capability:

There has been a lot of hype recently about Generative AI and Copilot in Microsoft. Microsoft Fabric incorporates many of those features, and one of the areas it has been added to is the Dataflow Gen2 in Microsoft Fabric, or we can also call it Power Query in Power BI Service Dataflows. In this article and video, I will describe how the Copilot works with Data Factory Dataflow Gen2, its requirements, and its examples.

Click through for the video and the article. The thing that I believe will keep many people from using this is that you need a Microsoft Fabric capacity of F64 or greater to get access to Copilot. That’s a pretty hefty requirement.

Comments closed