Press "Enter" to skip to content

Category: Data Modeling

Microsoft Fabric and Semantic Models

Kurt Buhler has a choose-your-own-adventure story:

Semantic models are integral to Microsoft Fabric. They use and are used by many of the different workloads. In Fabric, there’s more items that can connect to and consume your model—such as semantic link in notebooks. Because of these new options and tools, your model is exposed to additional types of users who will use it in different ways. As such, it’s important that you make good models that you manage well throughout their entire lifecycle.

Read on for more information and three separate scenarios

Comments closed

Using Schema Registry for Data Quality in Apache Kafka

Kai Waehner talks data quality:

Good data quality is one of the most critical requirements in decoupled architectures, like microservices or data mesh. Apache Kafka became the de facto standard for these architectures. But Kafka is a dumb broker that only stores byte arrays. The Schema Registry enforces message structures. This blog post looks at enhancements to leverage data contracts for policies and rules to enforce good data quality on field-level and advanced use cases like routing malicious messages to a dead letter queue.

Click through to learn more about the topic. This focuses a lot on the “why” and “what” but does have an example of “how” in there as well.

Comments closed

The Value of Data Lineage

Chisom Kanu explains why data lineage matters:

Data lineage is a component of modern data management that helps organizations understand the origins, transformations, and movement of their data. It is like a road map that shows us where our data has been, how it has changed, and where it is going, just like tracking the journey of a package: from the person who sent it (the source) to the places it passes through, and finally to the person who receives it.

The concept of data lineage has been around for many years, but it has become increasingly important in recent years due to the growth of big data and the increasing complexity of data processing systems.

Read on to learn more about data lineage.

Comments closed

Building a Multi-Tenant Database

Adron Hall looks at multi-tenancy within Postgres:

Music has always been a significant part of my life. From the melodies that accompany my daily routines to the anthems of my most memorable moments, it’s been a constant. As my collection grew, I realized I needed a better way to organize it. That’s when I stumbled upon the concept of multi-tenancy databases and decided to give it a shot with PostgreSQL. Here’s my experience.

Multi-tenancy is one case in which I’m much more relaxed about including the tenant ID on tables where it is not absolutely necessary in order to prevent a series of joins to get the appropriate tenant ID. We can quibble about whether that’s reasonable denormalization or appropriate use of a superkey—especially because, in SQL Server, tenant ID ends up being part of the clustered index and likely part of the primary key anyhow—but it’s extremely useful nonetheless.

Comments closed

String Regularization and Tokenization in SQL Server

Aaron Bertrand saves some space:

The Stack Exchange network logs a lot of web traffic – even compressed, we average well over a terabyte per month. And that is just a summarized cross-section of our overall raw log data, which we load into a database for downstream security and analytical purposes. Every month has its own table, allowing for partitioning-like sliding windows and selective indexes without the additional restrictions and management overhead. (Taryn Pratt talks about these tables in great detail in her post, Migrating a 40TB SQL Server Database.)

It’s no surprise that our log data is massive, but could it be smaller? Let’s take a look at a few typical rows. While these are not all of the columns or the exact column names, they should give an idea why 50 million visitors a month on Stack Overflow alone can add up quickly and punish our storage:

Click through for one technique Aaron has to tighten things up a bit.

Comments closed

Building a Report Model from Agile User Stories

Kelly Broekstra digs into a story:

Are you a model designer or BI developer tasked with building a data model and/or report from a series of user requirements or Agile User Stories? Do you know where to start, or what to try first? Here are some practical tips and techniques that I have used to design a great report model from Agile user stories.

Read on for questions to ask and how to translate that into a star schema model.

Comments closed

The Concept of Schema in Relational Databases

Adron Hall explains how different relational database management systems describe schemas:

From the viewpoint of someone familiar with the general idea of a schema, it can indeed seem unusual that databases like SQL Server, Oracle, MariaDB/MySQL, and PostgreSQL each interpret and implement schemas in slightly (or sometimes, vastly) different ways. While the core idea behind a schema as a structured container or namespace for database objects remains somewhat consistent, the exact nature, utility, and behavior of schemas vary across these systems.

Read on for an overview of these for four products, as well as what the ANSI standard indicates.

Comments closed

The Medallion Architecture in Data Modeling

Nikola Ilic gets the gold:

The most common pattern for modeling the data in the lakehouse is called a medallion. I love this name – it’s really easy to remember. But, why medallion? Tag along and you’ll soon find out why.

The same as for the lakehouse concept, credits for being pioneers in the medallion approach goes to Databricks.

What I’ve found interesting is the number of people who have taken to disliking the medallion architecture terms because Databricks pushed it so hard that their clients automatically assumed “medallion = using Databricks.”

Comments closed

A Story of Bad Data Modeling

Kendra Little unravels a puzzle:

I double-checked my queries. Had I goofed in my sql? Nope. Next, I looked into if some of the data was in an inconsistent state.

What I found was worse than what I’d imagined. As a data person, it made me feel sad and icky.

That’s because it’s usually not too hard to clean up bad data. It’s almost always much harder to fix a badly designed data model which is already established in production.

Read on for a tale as old as time: the clarion call of expediency now causing pain later.

Comments closed

An Overview of Data Modeling

Nikola Ilic provides an overview of data modeling:

In recent years, I’ve done dozens of training on various data platform topics, for all kinds of audiences. When teaching various data platform concepts and techniques, I find one of the concepts particularly intimidating for many business analysts, especially those who are just starting their journey. And, that is the concept of data modeling.

This is a good introduction and does a particularly good job of explaining why we have logical and physical data models. I have one medium-sized quibble with an otherwise-great article: 3rd Normal Form is nowhere near sufficient for a logical data model, and I’d make the strong case (in fact, I do make that case) that 5th Normal Form should be the standard and that 3NF is an anachronism which you should entirely replace with Boyce-Codd Normal Form.

Comments closed