Press "Enter" to skip to content

Category: Data Modeling

Defining a Data Contract

Buck Woody becomes accountable:

A businessperson pulls a report from a data warehouse, runs the same query they’ve used for two years, and gets a number that doesn’t match what the finance team presented at yesterday’s board meeting. Nobody changed the report. Nobody changed the dashboard. But somewhere upstream, an engineering team renamed a field, shifted a column type, or quietly altered the logic in a pipeline, and nobody thought to mention it because there was no mechanism to mention it.

While we think of this as an engineering failure, it’s more of an implied contract failure. More precisely, it’s the absence of a formal contract. Data contracts are one of the most practical tools a data organization can adopt, and one of the most underused. The idea is not complicated: a data contract is a formal, enforceable agreement between the team that produces data and the team that consumes it. It defines what the data looks like, what quality standards it must meet, who owns it, and what happens when something changes. Think of it as the API layer for your data, the same guarantee a software engineer expects from a well-documented endpoint, applied to the datasets and pipelines your business depends on. This post is about why that matters at the CDO level and how to get them put in place.

Click through to learn more about what data contracts are and why they are useful. This post stays at the architectural level rather than the practitioner level, but lays out why it’s important to think about these sorts of things.

Leave a Comment

An Overview of Fabric IQ

Brian Bonk talks ontologies:

If you followed along with the announcements from Microsoft Ignite, you might have stumbled upon the new Fabric IQ service.

For many people, this new service can seem a bit strange to see the point in, so in this blogpost I will try to help you understand the usage and business value of the new service.

Ontologies aren’t new—it’s mostly a metadata management exercise—but there are several companies (like Palantir) pushing this hard in their tools, and Microsoft is working that market segment. But instead of using all of this metadata management for data quality or master data management reasons, it’s for feeding into language models.

Comments closed

Thoughts on Data Modeling

Steve Jones has a two-fer. First up, he asks an opinion question about data modeling:

Recently, I had a few questions on database modeling. One was posted in the SQL Server Central forums, and a customer asked about ERD tooling on the same day. This came shortly after Redgate acquired Vertabelo (now Redgate Data Modeler). This stood out to me as very rarely in the last few years have I found people consulting and updating a diagram while performing database development.

Second, he takes a peek at a tool Redgate purchased:

Redgate acquired a data modeling tool from Vertabelo recently and I wanted to explore how it works. This is a short look at this tool and how it might be useful in working with databases.

My experience with data modeling has been that only the really large companies did a lot of work with upfront data modeling and keeping logical models up to date. It’s still quite useful for data warehouses, and that’s where the people I know who do a lot of data modeling make their living. But I find it’s too much of a hassle in fast-paced environments, especially when I can keep most or all of the data model in my head and I’m the person managing it all.

Essentially, data models are useful to the extent that they’re approximately true. But because they quickly get out of sync with reality, they quickly go from “quite useful” to “dirty lies.”

Comments closed

What-If Analysis in Power BI

Ben Richardson takes us through a what-if analysis:

What If Analysis is a modelling technique used to evaluate different outcomes by changing key input variables.

In Power BI, it uses What If parameters and dynamic DAX measures that recalculate outputs based on user input. Users can ask questions like:

  • “What if sales increase by 10%?”
  • “What if production costs drop by 5%?”

The parameters are created in the Modelling tab, where you define value ranges. Power BI automatically generates a slicer and a measure, which can then be used in DAX calculations to dynamically adjust metrics like revenue, cost, or profit.

Read on to see how it works, understanding that you have to provide the formulas for behavior. In other words, if your what-if parameter is around the unit price of some product, there is no built-in concept of price elasticity for the product. That’s something you’d have to implement yourself.

Comments closed

Thoughts on Data Integrity

Deborah Melkin shares some thoughts:

The first way to think of data integrity is a very small and literal interpretation. This is making sure that our data in the database is good. In many ways, these are easy to enforce – you add constraints. Primary Keys ensure that you know what makes each row unique. Unique constraints represent what would make each record unique if the primary key constraint, which is often a surrogate key these days, didn’t exist or offer different options. 

Read on for more about database design, default constraints, and a dive into data modeling.

Comments closed

Custom SCD2 with PySpark

Abhishek Trehan creates a type-2 slowly changing dimension:

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

SCD2 is a dimension that stores and manages current and historical data over time in a data warehouse. The purpose of an SCD2 is to preserve the history of changes. If a customer changes their address, for example, or any other attribute, an SCD2 allows analysts to link facts back to the customer and their attributes in the state they were at the time of the fact event.

Read on for an implementation in Python.

Comments closed

Implementing Role-Playing Dimensions in Power BI

Teo Lachev puts on a mask:

Role-playing dimensions are a popular business requirement but yet challenging to implement in Power BI (and Tabular) due to a long-standing limitation that two tables can’t be joined multiple times with active relationships. Declarative relationships are both a blessing and a curse and, in this case, we are confronted with their limitations. Had Power BI allowed multiple relationships, the user must be prompted which path to take. Interestingly, a long time ago Microsoft considered a user interface for the prompting but dropped the idea for unknown reasons.

Given the existing technology limitations, you have two implementation choices for implementing subsequent role-playing dimensions: duplicating the dimension table (either in DW or semantic model) or denormalizing the dimension fields into the fact table. The following table presents pros and cons of each option:

Click through for that table, as well as some thoughts on viable approaches, including an edge case.

Comments closed

Tips for Optimizing Power BI Semantic Models

Koen Verbeeck shares some tips:

Power BI is designed to be user-friendly. With just a few clicks, you can import data from various sources, combine them together in one data model and start analyzing it using powerful data visualizations. This sometimes leads to a scenario where people are just importing data into the tool without giving it too much thought. When you’re working on a solo project on a small dataset, there probably won’t be too many issues. But what if your report is successful and you want to share it with your colleagues and maybe other departments? Or more data is loaded into the model, but refreshes are taking more and more time? Even other data sources are added into your model, but writing DAX formulas has become hard, and reports are slowing down.

In this article, we’ll cover a couple of tricks that will help you make your Power BI models smaller, faster and easier to maintain. In the immortal words of Daft Punk: “Harder. Better. Faster. Stronger”.

Click through for those tricks and tips.

Comments closed

Microsoft Purview Classifications and Sensitivity Labels

James Serra labels the data:

I see a lot of confusion on how classifications and sensitivity labels work in Microsoft Purview. This blog will help to clear that up, but I first must address the confusion with Purview now that multiple products have been renamed to Microsoft Purview. I decided to use a question-and-answer format that will hopefully clear up the confusion (I was very confused too!):

Purview is a fantastic product. I just wish it cost about 10% as much as it does; then I could heartily recommend it to people.

Comments closed

Microsoft Fabric and Semantic Models

Kurt Buhler has a choose-your-own-adventure story:

Semantic models are integral to Microsoft Fabric. They use and are used by many of the different workloads. In Fabric, there’s more items that can connect to and consume your model—such as semantic link in notebooks. Because of these new options and tools, your model is exposed to additional types of users who will use it in different ways. As such, it’s important that you make good models that you manage well throughout their entire lifecycle.

Read on for more information and three separate scenarios

Comments closed