Press "Enter" to skip to content

Category: Warehousing

Type 2 Dimension Loading in Redshift

Vaidy Kalpathy takes us through a bit of dimensional modeling in AWS Redshift:

Populating an SCD dimension table involves merging data from multiple source tables, which are usually normalized. SCD tables contain a pair of date columns (effective and expiry dates) that represent the record’s validity date range. Changes are inserted as new active records effective from the date of data loading, while simultaneously expiring the current active record on a previous day. During each data load, incoming change records are matched against existing active records, comparing each attribute value to determine whether existing records have changed or were deleted or are new records coming in.

Click through for the article.

Comments closed

Data Mesh Q&A Round 2

Jean-Georges Perrin didn’t hear no bell:

How does the Data Mesh concept differ from similar efforts in the past, like EDM (Enterprise Data Model) or MDM (Master Data Model)?
Data Mesh will help us achieve those goals more quickly as those EDM and MDM projects are usually slow, and the ROI starts showing only after deployment. The product approach of Data Mesh for its data products enables a product lifecycle mentality that will help get from a current state to an (end?) state like EDM through versioning. It also allows EDM to be versioned more efficiently and reduces time to market.

Read on for a series of questions and answers around the topic of data mesh architecture.

Comments closed

Sun Modeling and SunBeam

Shannon Bloye takes us through a new analytics systems modeling technique:

 Sun Modelling was a technique initially developed and taught by Mark Whitehorn as a professor of analytics at the University of Dundee. Which is where our own Terry McCann encountered the approach whilst studying for his MSc. He does a great talk on the topic in this video.

A core aim of the method is to offer a simplicity that makes it accessible to end users as well as the usual technical professionals. The approach is a high-level visual means to model data around a business process.

This feels a bit like a Kimball model but where you’re explicitly diagramming hierarchies and common slicers.

Comments closed

Splitting Strings to a Table in Snowflake

Kevin Wilkie puts on the flannel and grabs his database lumberjack axe to split some strings:

But, sometimes, you want a small list of data inherent to a query in SnowFlake. And that’s what I want to talk about today.

In SQL Server, you would create a temp table and then insert the data into it. But in Snowflake, there may be a better / easier way.

Let’s use the function SPLIT_TO_TABLE. Shockingly, it does what’s on the label – it split data and puts it into a table.

Click through for an example. Also check out the Snowflake documentation, where they make use of the lateral operator (the ANSI version of APPLY()) to generate results for multiple strings and make use of the SEQ column.

Comments closed

Amazon Redshift 2022 in Review

Manan Goel lists what’s been going on with Amazon Redshift:

In 2021, we launched Amazon Redshift Query Editor V2, which is a free web-based tool for data analysts, data scientists, and developers to explore, analyze, and collaborate on data in Amazon Redshift data warehouses and data lakes. In 2022, Query Editor V2 got additional enhancements such as notebook support for improved collaboration to author, organize, and annotate queries; user access through identity provider (IdP) credentials for single sign-on; and the ability to run multiple queries concurrently to improve developer productivity.

Read on for the rest of the highlights.

Comments closed

Managed Self-Service BI in Power BI

Gogula Aryalingam has started a series on managed self-service BI. Part 1 provides an overview of the topic:

When putting together a business intelligence strategy using Power BI, Microsoft recommends three primary strategies that an organization can adopt. Out of these, the one that I tend to go with is managed self-service BI, which brings forth the concept of discipline at the core, flexibility at the edge. This concept is the dominant strategy used for BI at Microsoft itself; explained very nicely in this article. It’s my personal favorite, because I find it an effective means of onboarding customers once the core platform is built with the required standards (discipline), and then help them adopt the solution from the edge, thus providing them with the best of both worlds.

Part 2 takes us to the edge:

Now, what happens when an analyst, for instance, has a set of sales target spreadsheets and wants to compare the figures with sales metrics so that salespeople’s performances can be measured? It certainly needs a new dataset. However, flexibility at the edge has to prevail in the right way. This post will look at how we can go about this keeping to discipline at the core, flexibility at the edge.

Note: The analyst’s requirement is at current local to their group or department. It has not yet been made an organizational requirement. That’s how most requirements start out: A requirement at the departmental level, and then when enough people start reaping the benefits within and outside of the department, it can get absorbed into the core.

Part 3 returns to the core:

One problem that we may have overlooked when building a bunch of core datasets in that post, is that certain dimensions tend to duplicate across the datasets. Imagine a scenario where the single master data source of a managed self-service setup is a data warehouse, which sources all the required dimensions. When you have, for example, core reseller sales, internet sales, and finance datasets, each one will have a calendar dimension and a few others created in each of these datasets. This is not ideal if you think about the extent of the duplication and effort that is required.

This is where, once again, using DQ for PBI datasets and AS comes into play, where you could draw up a layered core dataset architecture. If we take the example of AdventureWorks’ fact tables in the data warehouse (single master data source) you can figure out what the business processes are. 

Read on for Gogula’s thoughts. I think there’s a lot going for this particular strategy, especially in a large organization with hundreds (or thousands) of people actively using Power BI. At that point, doing everything through a central IT organization doesn’t scale very well.

Comments closed