Press "Enter" to skip to content

Curated SQL Posts

Common Warehouse Load Patterns

Ben Johnston continues a series on warehouse load patterns:

This continues and finishes my two-part series on warehouse load patterns. There are many methods to transfer rows between systems from a basic design perspective. This isn’t specific to any ETL tool but rather the basic patterns for moving data. The most difficult part in designing a pattern is efficiency. It has to be accurate and not adversely impact the source system, but this is all intertwined and dependent on efficiency. You only want to move the rows that have changed or been added since the previous ETL execution, deltas. This reduces the network load, the source system load (I/O, CPU, locking, etc.), the destination system load. Being efficient also improves the speed and as a direct result it increases the potential frequency for each ETL run, which has a direct impact on business value.

The pattern you select depends on many things. The previous part of the series covers generic design patterns and considerations for warehouse loads that can be applied to most of the ETL designs presented below. This section covers patterns I have used in various projects. I’m sure there are some patterns I have missed, but these cover the most used types that I have seen. These are not specific to any data engine or ETL tool, but the examples use SQL Server as a base for functionality considerations. Design considerations, columns available, administrative support, DevOps practices, reliability of systems, and cleanliness of data all come into consideration when choosing your actual ETL pattern.

Click through for a compendium of common patterns you can use to indicate that a row should go into a warehouse.

Comments closed

A Focus on TRY_PARSE()

Andy Brownsword takes a closer look at TRY_PARSE():

In the previous post we looked at the functions TRY_CASTTRY_CONVERT, and TRY_PARSE and how they compared. I wrapped up and said that my preference for new developments would be to use TRY_PARSE due to the tighter control which it provides us.

As with everything in SQL Server however, there is no ‘best’ approach, it depends. I therefore wanted a separate post to look into the specifics with TRY_PARSE and areas where it may work more or less effectively.

Andy hits both the good and the bad of TRY_PARSE() and I recommend checking out this post. It’s great for parsing one row or a small number of rows (call it 5-10K or so), and really bad at parsing large numbers of rows.

Comments closed

Fabric Data Pipeline for Blob Storage CSV into Azure SQL DB

Andy Leonard loads some data:

In November 2023, I shared how to start learning Microsoft Fabric in a post titled Start a Fabric Free Trial. In December 2023, I shared how to Create a Workspace in Fabric. In this post, I document one way to create a pipeline to load data from a CSV file stored in Azure Blob Storage to Azure SQL Database in your new Fabric workspace.

Click through for some key assumptions, as well as the process.

Comments closed

Thoughts on Temp Tables in Stored Procedures

Erik Darling has preferences:

I probably overuse temp tables. I’m the first one to admit it. But often when I choose to use one, it’s a choice between:

  • Spending a long time trying to business understand logic for a company I don’t work for
  • Tinkering with query syntax and using really weird tricks and hints to get different plan shapes
  • Waiting a long time to create or tweak indexes on huge tables (usually on “dev” hardware)

Erik provides a good example of where a temp table fits well, contrasting it to a common table expression that fares poorly. My bias tends to come in the opposite direction from Erik’s: I’m susceptible to playing query golf and getting everything I need back in one optimized call. But part of craftsmanship is knowing how and when to use each tool.

Comments closed

Plotting Time Series in R

Steven Sanderson builds some charts:

Our Flight Plan:

  1. Loading Up with Data: Grabbing our trusty dataset, AirPassengers.
  2. Taking Off with Base R: Creating a basic time series plot using base R functions.
  3. Soaring with ggplot2: Crafting a visually stunning time series plot using the ggplot2 library.
  4. Navigating Date Formatting: Customizing axis labels with scale_x_date() for clarity.
  5. Landing with Your Own Exploration: Encouraging you to take the controls and create your own time series plots!

Click through to see each of these steps in action.

Comments closed

Advent of Code Day 8 T-SQL Solution

Kevin Wilkie continues a series on this year’s advent of code. Day 8:

To make life a little simpler for all of us, I used a variable that I just pasted my instructions into. In my case, I called it @FollowThis allowing it to be up to 300 characters in length. Hopefully, no one’s instructions list is over that length – if so you’ll need to make that adjustment.

I also created a variable called @BackUpOfFollowThis that is strictly a copy of the original @FollowThis variable. (No muss, no fuss)

Once I place all of my inputs into the dbo.AOCDay8 table, we can easily start slicing and dicing.

And the more complex form for day 8:

First, we find out that we’re now starting with multiple positions – anything that ends with an A – and finishing anywhere that ends in a Z.

To make this happen, we’re going to want to create another table that will have all of our positions that end with the letter A.

Click through for Kevin’s solutions.

Comments closed

The Value of Data Lineage

Chisom Kanu explains why data lineage matters:

Data lineage is a component of modern data management that helps organizations understand the origins, transformations, and movement of their data. It is like a road map that shows us where our data has been, how it has changed, and where it is going, just like tracking the journey of a package: from the person who sent it (the source) to the places it passes through, and finally to the person who receives it.

The concept of data lineage has been around for many years, but it has become increasingly important in recent years due to the growth of big data and the increasing complexity of data processing systems.

Read on to learn more about data lineage.

Comments closed

DAX in Review: 2023

Marco Russo gives us a review:

Despite the large number driven by the new INFO functions, we did not see big changes in DAX, but rather a consolidation of the new window functions.

After three years, in 2023 the composite models reached the general availability! More important, the official name is now “composite models”, dropping the previous “DirectQuery for Power BI datasets and Analysis Services” name used in preview. Add to this that “datasets” have been renamed to “semantic models” (a change we fully support at SQLBI), and it seems this year we made peace with the Microsoft naming departments (there should be more than one).

Click through for more on DAX, as well as what Marco and Alberto have been up to this year.

Comments closed

Monitoring Checkpoints and the Background Writer in Postgres

Muhammad Ali keeps an eye on things:

In PostgreSQL, a checkpoint is an operation during which the database flushes/syncs all pending modifications(dirty buffers) inside memory to the actual data files on the disk.

This is important for two primary reasons. Firstly, it guarantees that all committed transactions are permanently stored, thereby safeguarding against data loss in the event of a system failure. Secondly, it works closely with the database recovery mechanism. If a crash occurs, PostgreSQL begins processing WAL logs starting from the last successful checkpoint(It gets this information from the pg_control file located in the PG data directory) during recovery. Additionally, this process allows for the fine-tuning of performance through a variety of parameters, adaptable to specific workload requirements which are discussed below.

Read on to learn more about how checkpoints work in Postgres, how the background writer works, and things to keep in mind.

Comments closed

Dynamic SQL in Stored Procedures

Erik Darling gets serious:

I’ve spent a lot of time on this blog telling you different reasons why you should, when you should, and how you should use dynamic SQL.

In this post, I’m going to go into some of the finer points of how I approach dynamic SQL to avoid issues — not performance issues — more procedural issues.

Things you should use to protect yourself from wonky object names, string truncation, object identification, and more.

Read on for Erik’s tips for handling dynamic SQL.

Comments closed