Curated SQL – Page 109 – A Fine Slice Of SQL Server

Emitting Data to a Single CSV in Spark

Published 2025-01-28 by Kevin Feasel

To write and read data faster, Spark splits the work between nodes in a cluster, each reading\writing part of the data. That’s why, in the screenshot above, there are 3 CSV files (That’s the files starting with “Part”, with a CSV extension), instead of 1. Note that this can also occur when working with a single node cluster since Spark splits the work into tasks.

This behavior is great if you intend to keep working with the CSV files in Databricks since reading will be faster. But if you want to share this file with someone outside of Databricks, this may be inconvenient.

Read on for two ways of doing this, as well as the price you pay to get it done.

Comments closed

Dropping a Role in PostgreSQL

Published 2025-01-28 by Kevin Feasel

Josephine Bush drops a role:

You can’t just exec DROP ROLE your_role_name; if it’s granted perms or other roles are granted to it. I had to go fishing to find all the grants to revoke them. Note: if you are worried about re-granting later, you can always fiddle with this to output the grants for these perms as a rollback.

Read on for a few scripts to help out with finding what that role owns, revoking rights, and reassigning ownership.

Comments closed

Truncating All Tables while Preserving Foreign Keys in T-SQL

Published 2025-01-28 by Kevin Feasel

Ronald Kraijesteijn builds a script:

When testing a data warehouse, a common challenge is managing large datasets effectively. Often, you need to reset tables to a clean state, ensuring consistent testing environments. The most efficient way to clear a table is using the SQL command TRUNCATE TABLE. However, this command is not straightforward when foreign key constraints are present. In this article, we’ll explore a solution that temporarily disables constraints, allows truncation, and then restores the constraints—keeping your data model intact.

Click through for the script, which saves a record of all of the foreign key constraints, truncates each table, and then re-creates the foreign keys.

Comments closed

Automatically Refreshing a Power BI Semantic Model after Dataflow Loads

Published 2025-01-28 by Kevin Feasel

Reza Rad refreshes a model:

Although this seems to be a simple thing to do, it is not a function that you can turn on or off. If you have a Dataflow that does the ETL and transforms and prepares the data, then to get the most up-to-date data into the report, you will need to refresh the Power BI semantic model after that, only upon successful refresh of both dataflow and semantic model is when you will have the up-to-date data into the report. Fortunately, in Fabric, this is a straightforward setup. In this article and video, I’ll explain how this is possible.

Click through for the video and the blog post. Granted, this feature is in preview, but using it is pretty straightforward.

Comments closed

Calculating the Distance between Points via T-SQL

Published 2025-01-28 by Kevin Feasel

Sebastiao Pereira gets out the measuring tape:

How do you calculate the distance between two different points on a sphere using TSQL in SQL Server? In this article, we look at how this can be done to calculate different points on the globe.

Sebastiao first shows the raw calculation, then uses the GEOGRAPHY data type to simplify the job. Note that these are “as the crow flies” distances and not travel distances, as there’s no information on roads.

Also, these are calculations specific to Earth. Which sounds like the setup for a joke, but it’s really not. If you have a smaller sphere (or oblong spheroid, if you will) and you want to calculate the distance, use the GEOMETRY data type instead.

Comments closed

Power BI Embedded and Direct Lake Mode

Published 2025-01-28 by Kevin Feasel

Hasan Abo Shally announces a new preview feature:

As we step into 2025, we’re excited to announce the preview of Power BI Embedded with Direct Lake Mode, a new feature designed to enhance how developers and Independent Software Vendors (ISVs) provide embedded analytics in their applications. This capability, available in preview starting Q1 2025, leverages the power of Direct Lake Mode to allow an even enhanced performance and experience for embedded analytics.

This probably affects a small audience, though the functionality is good to see.

Comments closed

Drawbacks of Version Store in SQL Server

Published 2025-01-28 by Kevin Feasel

Haripriya Naidu points out that nothing comes for free:

With every feature comes both advantages and disadvantages and it is important to be aware of both. In regards to version store, we had discussed how to understand version store better and how version store is enabled.

Today, we will talk about the trade-offs and implications that come along when version store is enabled.

Click through for the downsides. With storage prices and speeds where they are, the tradeoffs of using the version store for RCSI or Snapshot isolation are almost always worth it. But it’s still good to know what those trade-offs are and how you can end up in a degenerate state.

Comments closed

An Overview of HyperLogLog

Published 2025-01-27 by Kevin Feasel

Bhala Ranganathan talks about a powerful algorithm:

Cardinality is the number of distinct items in a dataset. Whether it’s counting the number of unique users on a website or estimating the number of distinct search queries, estimating cardinality becomes challenging when dealing with massive datasets. That’s where the HyperLogLog algorithm comes into the picture. In this article, we will explore the key concepts behind HyperLogLog and its applications.

HyperLogLog is the algorithm that SQL Server users in the APPROX_COUNT_DISTINCT() function to make it so much faster than a regular COUNT(DISTINCT) while still providing correctness guarantees within a fixed percentage error: they guarantee a 2% or lower error rate with a 97% probability.

Comments closed

Partitioned Tables and Indexes in PostgreSQL

Published 2025-01-27 by Kevin Feasel

Hettie Dombrovskaya runs into an error:

Here is a story. When anyone gives a talk about partitions, they always bring up an example of archiving: there is a partitioned table, and you only keep “current” partitions, whatever “current” means in that context, and after two weeks or a month, or whatever interval works for you, you detach the oldest partition from the “current” table and attach it to the “archived” table, so that the data is still available when you need it, but it does not slow down your “current” queries.

So here is Hettie confidently suggesting that a customer implement this technique to avoid querying a terabyte-plus-size table. A customer happily agrees, and life is great until one day, an archiving job reports an error of a “name already exists” for an index name.

Read on to learn why.

Comments closed

MDX vs DAX for Bulk Data Extraction from Power BI

Published 2025-01-27 by Kevin Feasel

Chris Webb performs a test but gives us a warning first:

This is a post I’ve avoided writing for many years, and before I carry on let me make one thing clear:

Doing bulk extracts of data from a Power BI semantic model is a **really** bad idea

My colleague Matthew Roche wrote a great post on this topic a couple of years ago that is still relevant: using Power BI (or Analysis Services) as a data source for other systems, including other Power BI Import mode semantic models, is an anti-pattern. Power BI is optimised for small, analytical queries that return the amount of data that can be visualised on a single page. It is not optimised for queries that return millions of rows.

After the warning, Chris still gives a performance breakdown for extracting data from a semantic model in Excel, using automated MDX and DAX.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Curated SQL Posts