Press "Enter" to skip to content

Day: July 8, 2024

Parquet Files in Pandas

Chris LaGreca works with Parquet files:

Apache Parquet has become one of the defacto standards in modern data architecture. This open source, columnar data format serves as the backbone of many high-powered analytics and machine learning pipelines, supported by many of the worlds most sophisticated platforms and services. AWS, Azure, and Google Cloud all offer built-in support for Parquet while big data tools like Hadoop, Spark, Hive, and Databricks natively support Parquet, allowing seamless data processing and analytics. Parquet is also foundational in data lakehouse formats like Delta Lake, Iceberg, and Hudi, where its features are further enhanced.

Parquet is efficient and has broad industry support. In this post, I will showcase a few simple techniques to demonstrate working with Parquet and leveraging its special features using Pandas.

Pandas does make this rather easy, as Chris shows.

Comments closed

Have a Recovery Strategy

Aaron Bertrand has a public service announcement:

I’ve talked about it before; you shouldn’t have a backup strategy, you should have a recovery strategy. I can’t possibly care if my backups succeed if I’m not bothering to test that they can be restored. And if they can’t be restored then, both technically and practically, I don’t have backups.

In one of the systems I manage, they built a very simple “test restore” process long before I became involved. Every night, it would pull the full backup for each database, restore it on a test system, and run DBCC CHECKDB against it. It would alert on any failure, of course, but the primary purpose was to always be confident that the backups could, in fact, be restored.

Aaron now has a much more robust version of this in place, which you can see in the article.

Comments closed

Cross-Workspace Data Transfer in Microsoft Fabric

Reitse Eskens moves some data around:

When you open Fabric, the first thing you need to do is choose a so-called workspace. This serves as a container for all your Fabric items. You can have one or more workspaces and the design is entirely up to you. From one workspace to rule them all to one workspace for each set of items (Lakehouse, Warehouse, Semantic model and Report, Pipeline, Notebook etc). Until yesterday (the day this blogpost came online) it was impossible to use a pipeline to get data across different workspaces.

You could work around it with tricks like shortcuts, but it feels more natural (or maybe I’m just old ;)) to be able to read data from workspace 1 and write it into workspace 2.
So let’s see how this works and, where capacity is used!

Click through to see it in action.

Comments closed

Merge Join vs Hash Join in Postgres

Andrei Lepikhov compares two physical join operators:

Today’s post is sparked by a puzzling observation: users, especially those who use an abstraction layer like REST or ORM library to interact with databases, frequently disable the MergeJoin option across the entire database instance. They justify this action by citing numerous instances of performance degradation.

Considering how many interesting execution paths MergeJoin adds to the system elaborating IncrementalSort or sort orderings derived from underlying IndexScan, it looks strange: one more bug of skewed cost balance inside the PostgreSQL cost model?

This is an interesting peek into how complex the query optimizers in database engines are, as well as how small amounts of information (via statistics or indexes) can matter to a query.

Comments closed

Antipattern: DAX Measures Never Returning Blank

Chris Webb explains the value of BLANK:

Following on from my earlier post on the Query Memory Limit in Power BI, and as companion to last week’s post on how a DAX antipattern using Calculate() and Filter() can lead to excessive memory consumption by queries (and therefore lead to you hitting the Query Memory Limit), in this post I want to look at the effects of another DAX antipattern on performance and memory usage: measures that can never return a blank value.

Read on to see how much of a difference using DAX to fill a grid with 0’s can make.

Comments closed

Limiting Jobs to the Primary Replica of an AG

Chad Callihan doesn’t want jobs running willy-nilly:

Transitioning from a failover cluster configuration to an Availability Group configuration brings with it all kinds of “fun” challenges. One such challenge that you may not have considered is the handling of jobs on whatever server is Primary, along with secondary servers. Let’s briefly discuss a potential challenge and an option to address it.

Click through for the example and a solution. Eitan Blumin has another solution in the comments, so check that one as well.

Comments closed