Press "Enter" to skip to content

Category: Warehousing

Landing Zone Layouts for Modern Data Warehouses

Paul Hernandez builds out a landing zone for a warehouse:

In this article I want to discuss some different layout options for a landing zone in a modern cloud data warehouse architecture. With landing zone, I mean a storage account where raw data lands directly from its source system (not to be confused with a landing zone to move a system or application into the cloud).

One of the things I appreciate a lot about this post is that it covers the history, showing us how we got to where we are. Paul’s well-versed in each step along the way and lays things out clearly.

Comments closed

End of Month in Snowflake and SQL Server

Kevin Wilkie is ready for that end-of-month paycheck:

When you work with data, you’ll probably need to work with dates at least once a month. That is the nature of the beast. Today, let’s compare working with them in SQL Server and Snowflake. I want to focus only on adding and subtracting months when provided with a specific day.

Along the way, I would also push for a calendar table, so that you can remove some of the more difficult (or even most common) date calculations.

Comments closed

Object Tagging in Snowflake

Warner Chaves tags a table:

A tag is a user-defined label that can be attached to a Snowflake object, such as a database, table, or column. Tags can categorize objects based on any criteria that you choose, such as sensitivity, business unit, project, or owner. Once tags have been applied, you can use them to control access to the tagged objects, track usage and costs, and apply policies and rules.

Now let’s apply tagging to a specific use case: identifying sensitive customer data. For example, let’s assume that you have a table in Snowflake called “customers” that contains customer information, including their addresses. We want to categorize the “address” column as sensitive so that we can apply data protection policies and controls.

Click through for a few examples of how to create tags, apply tags to database objects, and review tagged objects.

Comments closed

Type 2 Dimension Loading in Redshift

Vaidy Kalpathy takes us through a bit of dimensional modeling in AWS Redshift:

Populating an SCD dimension table involves merging data from multiple source tables, which are usually normalized. SCD tables contain a pair of date columns (effective and expiry dates) that represent the record’s validity date range. Changes are inserted as new active records effective from the date of data loading, while simultaneously expiring the current active record on a previous day. During each data load, incoming change records are matched against existing active records, comparing each attribute value to determine whether existing records have changed or were deleted or are new records coming in.

Click through for the article.

Comments closed

Data Mesh Q&A Round 2

Jean-Georges Perrin didn’t hear no bell:

How does the Data Mesh concept differ from similar efforts in the past, like EDM (Enterprise Data Model) or MDM (Master Data Model)?
Data Mesh will help us achieve those goals more quickly as those EDM and MDM projects are usually slow, and the ROI starts showing only after deployment. The product approach of Data Mesh for its data products enables a product lifecycle mentality that will help get from a current state to an (end?) state like EDM through versioning. It also allows EDM to be versioned more efficiently and reduces time to market.

Read on for a series of questions and answers around the topic of data mesh architecture.

Comments closed

Lateral Joins in Snowflake

Kevin Wilkie makes a lateral move:

In Snowflake, you can do something akin to an APPLY, but not. That, my friends, is a LATERAL join. According to Snowflake documentation, this type of join “allows an inline view to reference columns from the table expression that precedes the inline view.”

Yes, that’s gibberish. Let’s show what you’re doing.

I’m such an APPLY apologist that the description made sense to me.

Comments closed

Sun Modeling and SunBeam

Shannon Bloye takes us through a new analytics systems modeling technique:

 Sun Modelling was a technique initially developed and taught by Mark Whitehorn as a professor of analytics at the University of Dundee. Which is where our own Terry McCann encountered the approach whilst studying for his MSc. He does a great talk on the topic in this video.

A core aim of the method is to offer a simplicity that makes it accessible to end users as well as the usual technical professionals. The approach is a high-level visual means to model data around a business process.

This feels a bit like a Kimball model but where you’re explicitly diagramming hierarchies and common slicers.

Comments closed

Splitting Strings to a Table in Snowflake

Kevin Wilkie puts on the flannel and grabs his database lumberjack axe to split some strings:

But, sometimes, you want a small list of data inherent to a query in SnowFlake. And that’s what I want to talk about today.

In SQL Server, you would create a temp table and then insert the data into it. But in Snowflake, there may be a better / easier way.

Let’s use the function SPLIT_TO_TABLE. Shockingly, it does what’s on the label – it split data and puts it into a table.

Click through for an example. Also check out the Snowflake documentation, where they make use of the lateral operator (the ANSI version of APPLY()) to generate results for multiple strings and make use of the SEQ column.

Comments closed

Amazon Redshift 2022 in Review

Manan Goel lists what’s been going on with Amazon Redshift:

In 2021, we launched Amazon Redshift Query Editor V2, which is a free web-based tool for data analysts, data scientists, and developers to explore, analyze, and collaborate on data in Amazon Redshift data warehouses and data lakes. In 2022, Query Editor V2 got additional enhancements such as notebook support for improved collaboration to author, organize, and annotate queries; user access through identity provider (IdP) credentials for single sign-on; and the ability to run multiple queries concurrently to improve developer productivity.

Read on for the rest of the highlights.

Comments closed