Press "Enter" to skip to content

Category: Business Intelligence

The Basics of Fact-Dimensional Modeling

Nikola Ilic gives us a primer on Kimball-style fact and dimensional modeling:

Before we come up to explain why dimensional modelling is named like that – dimensional, let’s first take a brief tour through some history lessons. In 1996, a man called Ralph Kimball published a book “The Data Warehouse Toolkit”, which is still considered a dimensional modelling “Bible”. In his book, Kimball introduced a completely new approach to modelling data for analytical workloads, the so-called “bottom-up” approach. The focus is on identifying key business processes within the organization and modelling these first, before introducing additional business processes.

This is a really good overview of the topic, though I’m saddened that “dimensional bus matrix” didn’t make the cut of things to discuss. Mostly because I like the name “dimensional bus matrix.”

Comments closed

Defining an Analytics Engineer

Ust Oldfield defines a term:

Analytics Engineering, along with Data Engineering and Report Engineering, is a specialised subset of skills that would previously be the preserve of a Business Intelligence (BI) Developer. The BI Developer was once a generalist data developer, whose overall responsibilities have been split out and shared among specialist developers as the prevalence of data across organisation has increased and the tools and technologies used to ingest, transform, and serve data have become more specialised and loosely integrated.

In the same way that Data Engineering borrowed and took inspiration from Software Engineering for applying repeatable and scalable patterns and techniques to the pipelines that ingest and cleanse data, as well as the rigorous testing of those pipelines, Analytics Engineering has borrowed and taken inspiration from Software Engineering too.

Click through for the specifics of what an Analytics Engineer does.

Comments closed

Seven Principles for BI Skill Development

Brett Powell thinks about first principles:

Data and analytics languages should be prioritized far beyond graphical interface tools/software/services and should form a solid foundation of a skillset. Unlike software applications and various user interface controls which change frequently, the essential concepts and semantics of data languages such as SQL and DAX don’t change nearly as frequently and thus languages offer a much greater return on the time invested to learn them. For example, the fundamental PowerShell scripting knowledge I built up years ago using the Windows PowerShell ISE can still be applied today in many different tools, apps, and services that weren’t around back then such as Azure Function Apps and Visual Studio Code.

In almost every BI project I can remember, even projects that were explicitly intended to use low-code or no-code tools, it was the combination of different languages such as SQL, DAX, Kusto (KQL), Power Fx, and others that delivered the most value or which made the difference between project success and failure. Similarly, even in projects in which my role was intended to exclusively focus on the data model layer with DAX, I’ve almost always found myself also writing SQL, Power Query (M) and using other languages and code either in the data warehouse or on the reporting layer.

Brett has put a lot of thought into this and I think many of the principles apply outside of business intelligence work as well.

Comments closed

Interchangability between ADF and Synapse Integration Pipelines

Paul Andrew makes a discovery:

Inspired by an earlier blog where we looked at ‘How Interchangeable Delta Tables Are Between Databricks and Synapse‘ I decided to do a similar exercise, but this time with the integration pipeline components taking centre stage.

As I said in my previous blog post, the question in the heading of this blog should be incredibly pertinent to all solution/technical leads delivering an Azure based data platform solution so to answer it directly:

Read on to learn the answer.

Comments closed

Diving into Data Mesh

Ust Oldfield has a series on data mesh. First up is a primer on the topic:

What is a Data Mesh? Data Mesh is new and was introduced in a few articles by Zhamak Dehghani, starting in May 2019. The first article, How To Move Beyond a Monolithic Data Lake to a Distributed and Data Mesh, and the second article, Data Mesh Principles and Logical Architecture, form the foundational thought pieces on data meshes. Very simply, and crudely, a Data Mesh aims to overcome the deficiencies of previous generations of analytical architectures by decentralising the ownership and production of analytical data to the teams who own the data domain. It is a convergence of Distributed Domain Driven Architecture, Self-serve Platform Design, and Product Thinking with Data.

Essentially applying the learnings from operational systems, of applying domain driven design, so that ownership of data is domain oriented. For example, a CRM team will own the Customer domain and all the data within it. Instead of data from all domains flowing into a central data repository, the individual domains curate and serve their datasets in an easily consumable way. A much closer alignment between operational and analytical data.

Then Ust has a deep dive:

A foundational principle of data mesh is the decentralisation of ownership of data to those closest to it. Those closest to it are those in the business, often using the operational systems and also using analytical data. Responsibility and ownership of the data is devolved from a central function to business units and domains. Therefore any change to how a business domain organises itself is limited to the domain itself rather than impacting the entire organisation. This is referred to as the bounded domain context.

Teams responsible for the business operation, e.g. CRM team responsible for Customers, are also responsible for the ownership and serving of the analytical data relating to their domain.

I probably need to spend more time thinking about it, but this feels like a rickety ladder of incentive compatibility problems. I agree with Ust’s characterization of warehouses as staid and that relatively little has happened since the Kimball model came to prominence, but this sounds like taking the problems of the Kimball model (slow to change, an iceberg of code & ETL, etc.) and effectively saying “Hey, here are slightly different views of the source systems, have fun” instead. That way, everybody gets to experience the joy of non-conforming dimensions, disagreeing facts, working with data in different grains, and seven terms for the same thing.

Comments closed

Fun with Excel Cube Functions

Chris Webb has a new series on Excel cube functions combined with Lambda helper functions. First up, Chris collects items from a set:

This example shows the fundamental problem that has always existed with CubeRankedMember though: in order to show all the items in a set you need to know how many items there are in advance, and populate as many cells with CubeRankedMember formulas as there are items. In this case see how the range B4:B6 contains the numbers 1, 2 and 3; these numbers are used in the formulas in the range C4:C6 to get the first, second and third items in the set.

If a fourth product was added to the table, however, it would not appear automatically – you would have to add another cell with another CubeRankedMember formula in it manually. I’ve seen some workarounds but they’re a bit hacky and require you to know what the maximum possible number of items in a set could ever be. Indeed that’s always been one of the key differences between cube functions and PivotTables: cube functions are static whereas PivotTables can grow and shrink dynamically when the data changes.

Read on for a better answer. Then, check out part 2, which covers generating a dynamic table using Excel cube functions and Lambda helpers:

In the first post in this series I showed how to use the new Excel Lambda helper functions to return an array containing all the items in a set. That isn’t very useful on its own, so in this post I’ll show you how to generate an entire dynamic table using Excel cube functions and Lambda helper functions.

In this post I’ll be using the same source data as in my previous post: a table containing sales data with just two columns.

Comments closed

Data Mesh and Ownership Strategies

James Serra aims to clear up some confusion:

I have done a ton of research lately on Data Mesh (see the excellent Building a successful Data Mesh – More than just a technology initiative for more details), and have some concerns about the paradigm shift it requires. My last blog tackled the one about Centralized vs decentralized data architecture. In this one I want to talk about centralized ownership vs decentralized ownership, along with another paradigm shift (or core principle) closely related to it, siloed data engineering teams vs cross-functional data domain teams.

First I wanted to mention there is a Data Mesh Learning slack channel that I have spent a lot of time reading and what is apparent is there is a lot of confusion on exactly what a data mesh is and how to build it. I see this as a major problem as the more difficult it is to explain a concept the more difficult it will be for companies to successfully build that concept, so the promise of a data mesh improving the failure rates for big data projects will be difficult to achieve if we can’t all agree exactly what a data mesh is. What’s more is the core principles of the data mesh sound great in theory but will have challenges in implementing them, hence my thoughts in this blog on centralized ownership vs decentralized ownership.

Read on for James’s take on the matter.

Comments closed

Historical Dimensions in a Kimball-Style Model

Vince Iacoboni takes a stab at improving the Kimball model:

We owe a lot to Ralph Kimball and friends. His practical warehouse design and conformed-dimension bus architecture are the industry standard. Business users can understand and query these warehouses directly and gain valuable insights into the business. Kimball’s practical approach focuses squarely on clarity and ease of use for the business users of the warehouse. Kudos to you and yours, Mr. Kimball.

That said, can the mainstay Type 2 slowly changing dimension be improved? I here present the concept of historical dimensions as a way to solve some issues with the basic Type 2 slowly changing dimension promoted by Kimball. As we will see, clearly distinguishing between current and past dimension values pays off in clarity of design, flexibility of presentation, and ease of ETL maintenance.

As I was reading this, I was thinking “This sounds like a type 4 SCD” and Vince walks us through the differences between the two ideas. I’m not absolutely sold on the idea, but it is certainly interesting.

Comments closed

Mistakes to Avoid in a BI Platform Migration

Chris Webb covers five things to consider when migrating your BI platform, using Power BI as an example:

Every report has a data source and getting source data in the right format for your BI platform is a substantial task – so much so, that you might be tempted to put Power BI on top of the data sources you have created for your previous BI platform with no changes. However different BI platforms need their data in different formats. Many BI platforms like their data munged together in one big table, sometimes even with data at different granularities in the same table. Power BI, on the other hand, likes its source data modelled as a star schema (you can find out what a star schema is and why it’s important here). If you don’t model your data as a star schema you may find that you see incorrect values in your reports, that report performance is poor, and that it’s a lot harder to write the DAX calculations that you need.

Four out of the five fit just as well with any other data platform technology.

Comments closed