Press "Enter" to skip to content

Curated SQL Posts

Verti-Parquet and DirectLake in Fabric

Jordan Witcombe provides an explanation:

The VertiPaq engine cleverly uses columnar storage for efficient querying and processing. It employs multiple compression techniques, including Run-Length Encoding (RLE) and Dictionary Encoding, to minimise storage space. Through finding optimal sort orders and value encoding, it achieves maximum space efficiency and performance. VertiPaq also utilises ‘In-Memory Column Store’ for fast query performance, ‘Predicate Pushdown’ to eliminate unnecessary data at query time, and ‘Block Decompression’ to only decompress relevant data blocks, making it a powerhouse for data management and retrieval.

Now, because of these ingenious tricks, we wave goodbye to traditional file formats like JSON or CSV. Instead, all data stored within the managed area of Fabric and OneLake uses either Parquet or Delta. It’s time to embrace these efficient, high-performing formats that bring the best out of VertiPaq’s compressive power. Let’s explore these further in the next section.

Read on for some comparisons in file size between Fabric and Databricks, as well as how they perform in Power BI.

Comments closed

Code Is a Liability

Nate McMaster (indirectly) talks dollars and cents:

Early in my software engineering career, a senior engineer at Microsoft told me “the best solution is one that requires no new code.” At the time, I thought this was nonsense. Is not my role as a software engineer to write code? Why would writing less or no code be better? More code means more bug fixes, more features, more services, and more tools. So why is more not always better?

Fast forward to 2023 – now I am the most senior engineer on a team, and I give the same guidance. Prefer solutions that require less or no code.

What led to this shift in perspective? 

Read on for Nate’s answer, which is well-written and makes a lot of sense. It’s also close to a topic I’ve written about in the past.

Comments closed

Union in KQL

Robert Cain combines some tables:

In today’s post we will look at the union operator. A union will create a result set that combines data from two or more tables into a single result set.

Unlike the join, which was covered in my previous post Fun With KQL – Join, the union does not combine the columns from each table into single rows. Rather it returns rows from the first table, then rows from the second table, then if supplied third, forth and so on.

Read on to see how union works as an operator and for several examples.

Comments closed

Ranking with DAX

Marco Russo and Alberto Ferrari break the ties:

DAX already offered two ranking functions so far: RANK.EQ and RANKX. While RANK.EQ is very seldom used, RANKX has been the primary ranking function for a long time. We wrote several articles about RANKX: one of them deals with ranking on multiple columns, a scenario that requires some DAX acrobatics to be solved as described in RANKX on multiple columns with DAX and Power BI.

The new RANK function makes ranking on multiple columns much easier because it offers sorting by multiple columns as a native feature, because it belongs to the family of window functions. Thanks to the concept of current row in window functions, RANK also helps in a subtle issue with RANKX – which despite being rare, might affect ranking on decimal numbers as described in Use of RANKX with decimal numbers in DAX.

Read on to see how you can use RANK() and how it compares to RANKX().

Comments closed

Creating an Azure DevOps YAML Pipeline for SQL Server Deploys

Oilivier Van Steenlandt updates to the new Azure DevOps model:

In one of my previous blog posts, I used the SQL Server database deploy task to deploy my DACPAC to SQL Server. Unfortunately, this task became deprecated in Release Pipelines. In this blog post, I would like to share the alternative.

Additionally, we will be moving from a Classic Release pipeline to a YAML pipeline. The YAML pipeline will be responsible for building and deploying our Database Projects.

Click through for the walkthrough.

Comments closed

Commits, Auto versus Manual

Chen Hirsh switches platforms and hits a nasty surprise:

Anyway, I altered some views and stored procedures in the development environment, and after getting approval from the testers, tried to make the same changes in the production environment.

And then something strange occurred, I run an alter view command, and DBeaver told me the command run successfully. But when I ran a select from that view, it still had the old logic before the change. Right clicking the view, and getting its code, showed that it really hasn’t change.

I solved that mystery quickly, though I’ve also worked with auto-commit platforms (SQL Server) and platforms where the default commit is manual (Oracle, Postgres) and have been burned the same way in the past.

Comments closed

Finding Memory Grant Details in sp_WhoIsActive

Erik Darling has a video for us. There’s no graf that I can include here, so I’m stuck having to come up with my own explanation…

This is an interesting video covering a fairly new feature in sp_WhoIsActive, as well as giving us some good information around the numbers meaning pages rather than (something)bytes, yet the memory_info column gives us results in kilobytes.

Also, be sure to grab version 12 of WhoIsActive.

Comments closed

SSMS 19.1 Changes and Roadmap

Erin Stellato has one more post for us:

In SQL Server 2022, support for backup to and restore from S3-compatible object storage was introduced, and in SSMS 19.1 we have added support for restoring from S3 for both Azure SQL Managed Instance and on-premises databases.  The Select backup devices dialog now includes S3 URL in the Backup media type drop down:

This is something I like. But do click through to learn more about SSMS (spoiler: it’s not going away and there will be an SSMS 20).

Comments closed

Calculating Time Series Differences

Steven Sanderson notices the difference:

The diff() function in R calculates the differences between consecutive elements in a vector or a time series. It takes a single argument, which is the input vector, and returns a new vector with the differences. This function is particularly useful for analyzing the rate of change, identifying patterns, and detecting anomalies in your data. 

Read on to see how you can use it, as well as some examples of usage.

Comments closed

Auto-Pausing Synapse Dedicated SQL Pools

Mark Broadbent saves some money via pool auto-pausing:

This capability is neither earth shatteringly new nor unexpected, and something that Databricks has provided for some time. Of the two Data Exploration & Data Warehousing Pool types, Synapse Serverless Pool (otherwise know as the built-in Pool) by its very definition does not incur compute charges when it is not running.

Therefore this leaves us with only dedicated SQL Pool to worry about and this is where our problems begin.

Click through for the scripts to pause and resume a dedicated SQL pool, and Mark promises a part 2 in which we see the automation.

Comments closed