Press "Enter" to skip to content

Author: Kevin Feasel

Migrating from Apache Airflow 2 to 3 on Amazon MWAA

Anurag Srivastava, et al, perform a migration:

Apache Airflow 3.x on Amazon MWAA introduces architectural improvements such as API-based task execution that provides enhanced security and isolation. Other major updates include a redesigned UI for better user experience, scheduler-based backfills for improved performance, and support for Python 3.12. Unlike in-place minor Airflow version upgrades in Amazon MWAA, upgrading to Airflow 3 from Airflow 2 requires careful planning and execution through a migration approach due to fundamental breaking changes.

This migration presents an opportunity to embrace next-generation workflow orchestration capabilities while providing business continuity. However, it’s more than a simple upgrade. Organizations migrating to Airflow 3.x on Amazon MWAA must understand key breaking changes, including the removal of direct metadata database access from workers, deprecation of SubDAGs, changes to default scheduling behavior, and library dependency updates. This post provides best practices and a streamlined approach to successfully navigate this critical migration, providing minimal disruption to your mission-critical data pipelines while maximizing the enhanced capabilities of Airflow 3.

Read on to see what has changed between these two major versions of Airflow, recommendations on what to look out for, and a step-by-step migration guide.

Leave a Comment

Resolving Write Conflicts in Microsoft Fabric Data Warehouse

Twinkle Cyril has a conflict:

Fabric Data Warehouse (DW) supports ACID-compliant transactions using standard T-SQL (BEGIN TRANSACTION, COMMIT, ROLLBACK) and uses Snapshot Isolation (SI) as its exclusive concurrency control model. All operations within a transaction are treated atomically—either all succeed or all fail. This ensures that each transaction operates on a consistent snapshot of the data as it existed at the start of the transaction, which means.

Read on to see what this means, as well as what happens when multiple writers interfere with one another and how to avoid these sorts of issues. My Kimball-coded brain says that, if you have a data warehouse, you should have one data loading process. In that case, it’s not easy for the single data loading process to get tripped up on its own.

Leave a Comment

Installing DBeaver and Connecting to Postgres

Garry Bargsley tries out DBeaver:

Whether you’re a seasoned DBA or just exploring database tools, DBeaver offers a powerful, cross-platform GUI for interacting with PostgreSQL and many other databases. As a continuation of the previous blog post on installing PostgreSQL, this guide will walk through installing DBeaver and setting up a connection to the PostgreSQL instance we created.

My biggest takeaway the last time I used DBeaver was, SQL Server has a great thing going with SSMS. But in fairness, that was a while ago and things could very well have gotten better in the meantime. Also, if you have to connect to a variety of data platforms, DBeaver is a pretty solid choice.

Leave a Comment

Explaining Totals in Power BI

Sheil Bakhshi performs a comparison:

The long-running debate around how Power BI calculates totals in tables and matrices has been part of the community conversation for years. Greg Deckler has kept the topic alive through his ongoing “broken totals” posts on social media, often suggesting that Power BI should include a simple toggle to make totals behave more like Excel. His continued campaign prompted a detailed reply from Daniel Otykier in his article No More Measure Totals Shenanigans, and earlier, Diego Scalioni explored how DAX evaluates totals internally in his post Cache me if you can: DAX Totals behind the scenes.

This blog brings all those perspectives together from a scientific and comparative angle. It looks at how totals are calculated in Power BI and compares that behaviour with Tableau, Excel, Paginated Reports, and even T-SQL. The goal is not to take sides, but to clear up the confusion around what is happening under the hood.

This is a very detailed and dispassionate explanation that helps make sense of the debate.

Leave a Comment

Static and Dynamic Bulk Insert into SQL Server

Rick Dobson inserts some data:

There are numerous use cases for multi-file imports of CSV files into a SQL Server table:

  • Dynamic SQL Server bulk insert loads are especially appropriate for tasks that extract content from multiple files to a SQL Server table where the source file names change between successive import jobs.
  • Static bulk insert loads target scenarios where the source file names do not change between successive import jobs.

Read on for examples of how to implement each. Admittedly, bulk insert has rarely worked all that well in my experience, whether due to permissions mishaps, poor data integrity, or sudden changes in data types between file runs. But it does tend to work a lot better if you have a specified data interchange format and a standardized process to prepare the data and make it available on disk for insertion.

Leave a Comment

Creating a Python Package via Poetry

Osheen MacOscar builds a package:

In this blog series (this and the next blog) I am going to demonstrate how to use Poetry to create a Python package, set up testing infrastructure and install it. I am going to be creating a wrapper around the Fantasy Premier League API and creating a function which can create a weekly league table.

This is a straightforward example of how to create a new Python package and add a function call to it.

Leave a Comment

Fun with SQL Firewall in Oracle

Brendan Tierney follows up on a SQL Firewall post:

In a previous post, we’ve explored some of the core functionality of SQL Firewall in Oracle 23ai, In this post I’ll explore some of the other functionality that I’ve had to use as we’ve deployed SQL Firewall over the past few weeks.

Sometimes, when querying the DBA_SQL_FIREWALL_VIOLATIONS view, you might not get the current up to-date violations, or if you are running it for the first time you might get now rows or violations being returned from the view. This is a slight timing issue, as the violations log/cacbe might not have been persisted to the data dictionary. If you end up in this kind of situation you might need to flush the logs to to data dictionary. To do this, run the following.

Click through for that command, as well as a few other scenarios and commands that may be of interest.

Leave a Comment

Locks in Microsoft Fabric Data Warehouse

Twinkle Cyril enumerates the lock types in Fabric Data Warehouse:

Fabric DW supports ACID-compliant transactions using standard T-SQL (BEGIN TRANSACTION, COMMIT, ROLLBACK) and enforces snapshot isolation across all operations. Locks in Fabric Data Warehouse are used to manage concurrent access to metadata and data, especially during DDL operations. Here’s how locking works:

Click through for a chart. The locking policy is a lot simpler than what we see in SQL Server and you can see a description of the pros and cons of that simpler approach.

Leave a Comment

The Intricacies of COUNT()

Louis Davidson can easily get to 20:

I was reading LinkedIn posts the other day when I saw this blog about what was apparently an interview question about some forms of a COUNT aggregate function

This was apparently asked in an interview. What will each of these constructs do in a SQL statement:

COUNT(*) = ?
COUNT(1) = ?
COUNT(-1) = ?
COUNT(column) = ?
COUNT(NULL) = ?
COUNT() = ?

There’s one tricky bit in this set. Louis then takes it a bit further with CASE expressions and variables, so check out the post for the answers as well as those additional examples in T-SQL.

Leave a Comment