Curated SQL – Page 277 – A Fine Slice Of SQL Server

Parquet Files in Pandas

Published 2024-07-08 by Kevin Feasel

Apache Parquet has become one of the defacto standards in modern data architecture. This open source, columnar data format serves as the backbone of many high-powered analytics and machine learning pipelines, supported by many of the worlds most sophisticated platforms and services. AWS, Azure, and Google Cloud all offer built-in support for Parquet while big data tools like Hadoop, Spark, Hive, and Databricks natively support Parquet, allowing seamless data processing and analytics. Parquet is also foundational in data lakehouse formats like Delta Lake, Iceberg, and Hudi, where its features are further enhanced.

Parquet is efficient and has broad industry support. In this post, I will showcase a few simple techniques to demonstrate working with Parquet and leveraging its special features using Pandas.

Pandas does make this rather easy, as Chris shows.

Comments closed

Have a Recovery Strategy

Published 2024-07-08 by Kevin Feasel

Aaron Bertrand has a public service announcement:

I’ve talked about it before; you shouldn’t have a backup strategy, you should have a recovery strategy. I can’t possibly care if my backups succeed if I’m not bothering to test that they can be restored. And if they can’t be restored then, both technically and practically, I don’t have backups.

In one of the systems I manage, they built a very simple “test restore” process long before I became involved. Every night, it would pull the full backup for each database, restore it on a test system, and run DBCC CHECKDB against it. It would alert on any failure, of course, but the primary purpose was to always be confident that the backups could, in fact, be restored.

Aaron now has a much more robust version of this in place, which you can see in the article.

Comments closed

Cross-Workspace Data Transfer in Microsoft Fabric

Published 2024-07-08 by Kevin Feasel

Reitse Eskens moves some data around:

When you open Fabric, the first thing you need to do is choose a so-called workspace. This serves as a container for all your Fabric items. You can have one or more workspaces and the design is entirely up to you. From one workspace to rule them all to one workspace for each set of items (Lakehouse, Warehouse, Semantic model and Report, Pipeline, Notebook etc). Until yesterday (the day this blogpost came online) it was impossible to use a pipeline to get data across different workspaces.

You could work around it with tricks like shortcuts, but it feels more natural (or maybe I’m just old ;)) to be able to read data from workspace 1 and write it into workspace 2.
So let’s see how this works and, where capacity is used!

Click through to see it in action.

Comments closed

Merge Join vs Hash Join in Postgres

Published 2024-07-08 by Kevin Feasel

Andrei Lepikhov compares two physical join operators:

Today’s post is sparked by a puzzling observation: users, especially those who use an abstraction layer like REST or ORM library to interact with databases, frequently disable the MergeJoin option across the entire database instance. They justify this action by citing numerous instances of performance degradation.

Considering how many interesting execution paths MergeJoin adds to the system elaborating IncrementalSort or sort orderings derived from underlying IndexScan, it looks strange: one more bug of skewed cost balance inside the PostgreSQL cost model?

This is an interesting peek into how complex the query optimizers in database engines are, as well as how small amounts of information (via statistics or indexes) can matter to a query.

Comments closed

Antipattern: DAX Measures Never Returning Blank

Published 2024-07-08 by Kevin Feasel

Chris Webb explains the value of BLANK:

Following on from my earlier post on the Query Memory Limit in Power BI, and as companion to last week’s post on how a DAX antipattern using Calculate() and Filter() can lead to excessive memory consumption by queries (and therefore lead to you hitting the Query Memory Limit), in this post I want to look at the effects of another DAX antipattern on performance and memory usage: measures that can never return a blank value.

Read on to see how much of a difference using DAX to fill a grid with 0’s can make.

Comments closed

Limiting Jobs to the Primary Replica of an AG

Published 2024-07-08 by Kevin Feasel

Chad Callihan doesn’t want jobs running willy-nilly:

Transitioning from a failover cluster configuration to an Availability Group configuration brings with it all kinds of “fun” challenges. One such challenge that you may not have considered is the handling of jobs on whatever server is Primary, along with secondary servers. Let’s briefly discuss a potential challenge and an option to address it.

Click through for the example and a solution. Eitan Blumin has another solution in the comments, so check that one as well.

Comments closed

Modifying Column Return Order in sp_QuickieStore

Published 2024-07-05 by Kevin Feasel

Josephine Bush demands order:

I love QuickieStore, but I wanted some columns to be at the front end of the results returned. Namely, I wanted top_waits, query_sql_text, and query_plan right after database name. This way I don’t have to scroll over to see those values.

Unfortunately, it would appear that there’s no advanced functionality for column ordering like we have for sp_whoisactive. But that didn’t deter Josephine, and you can grab a copy of an updated script that includes columns in this different arrangement.

Comments closed

Dealing with Query Store in Error State

Published 2024-07-05 by Kevin Feasel

David Fowler turns it off then back on again, like a true IT professional:

I recently received a complaint that Query Store for a particular database was turned off, which was strange as that particular database has seen quite a few performance issues and I know that I’d ensured Query Store was enabled in the past.

No problem, I flicked the switch and Query Store was enabled again.

Half an hour or so later and I’m being told that Query Store is again disabled. What’s going on?

Read on to learn what to do if you get stuck with this problem.

Comments closed

Real-Time Intelligence in Microsoft Fabric

Published 2024-07-05 by Kevin Feasel

Dennes Torres takes a peek at a service with a new name:

When everyone starts to announce Real-Time Intelligence in Microsoft Fabric as something new, I need to double check what’s happening: Am I crazy or is everyone else? Wasn’t this already there?

Finally, I realize that Real-Time Intelligence is a new name for Real-Time Analytics, and they are doing this so fast we don’t even have time to notice the difference.

What’s Real-Time Intelligence and what’s the difference from Real-Time Analytics?

Read on for those answers.

Comments closed

Performance Testing Microsoft Fabric Dataflow Gen2

Published 2024-07-05 by Kevin Feasel

Reitse Eskens hammers away:

In my previous blogs, I’ve been hammering Fabric with data from some different angles. Either with the Copy dataflows, notebooks, Pipelines, Data Warehouse SQL scripts or in PowerBI.
This time, I’m going to make the dataflow Gen2 work for it’s money.

Reitse tries the normal mechanism for Dataflows Gen2, but then also tries out a preview feature for fast copy and sees a marked difference.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Curated SQL Posts