Press "Enter" to skip to content

Category: Data Modeling

Finding Unused Columns in Power BI Data Models

Matt Allington wants to trim the fat:

I have a saying in Power BI. Load every column you need, and nothing that you don’t need. The reason for this advice is that columns can make your data model bigger and less performant. You will of course need some columns in your data model for different purposes. Some are used for defining measures and some are used for slicing, dicing and summarising your data in the various visuals. But it is very common for people to load everything from the source, meaning that some of the columns are likely to be loaded but not used. Once the data model is ready and the reporting is done, it can be beneficial to remove the columns that are not being used and are not likely to be used for ad hoc reporting in the near future. The question is – how do you find the columns not being used? This is where Imke’s Power BI Cleaner tool comes in; I will show you how to use it below.

Read on for Seven Minute Abs for your Power BI data model.

Comments closed

Validating Data Model Results

Paul Turley continues a discussion on Power BI data model validation:

We often have users of a business intelligence solution tell us that they have found a discrepancy between the numbers in a Power BI report and a report produced by their line-of-business (LOB) system, which they believe to be the correct information.

Using the LOB reports as a data source for Power BI is usually not ideal because at best, we would only reproduce the same results in a different report. We typically connect to raw data sources and transform that detail data, along with other data sources with historical information to analyze trends, comparisons and ratios to produce more insightful reports.

However, if the LOB reports are really the north star for data validation, these can provide an effective means to certify that a BI semantic model and analytic reports are correct and reliable.

Click through for more details.

Comments closed

An Introduction to Data Vault

Tino Zishiri walks us through the basics of the Data Vault modeling technique:

The Data Vault methodology also addresses a common limitation that relates to the dimensional model approach. There are many good things to say about dimensional modelling, it’s a perfect fit for doing analytics, it’s easy for business analysts to understand, it’s performant over large sets of data, the list goes on.

That said, the data vault methodology addresses the limitations of having a “fixed” model. Dimensional modelling’s resilience to change or “graceful extensibility”, as some would say, is well documented. It’s capable of handling changing data relationships which can be implemented without affecting existing BI apps or query results. For example, facts consistent with the grain of an existing fact table can be added by creating new columns. Moreover, dimensions can be added to an existing fact table by creating new foreign key columns, presuming they don’t alter the fact table’s grain.

The most interesting thing to me about Data Vault is that it’s very popular in Europe and almost unheard-of in North America. That’s the impression I get, at least.

Comments closed

When Date Tables Go Bad

Brent Ozar walks through a scenario in which a calendar table (AKA, date dimension) makes a query perform quite a bit worse:

So why did the date table not perform as well as the old-school way?

SQL Server doesn’t understand the relationship between these two tables. It simply doesn’t know that all of the rows in the Users table will match up with rows in the calendar table. It assumes that the calendar table is doing some kind of filtering, especially given our CAST on the date. It doesn’t expect all of the rows to match.

My reaction was pretty much the same as Koen Verbeeck’s in the comments. Put in clearer terms, calendar tables work best when you’re joining a DATE type to a DATE type. Once you introduce times into the mix, the optimizer has to behave differently, not least because you have to do things like CAST() to coerce data types.

Comments closed

Building Queues in the Database

Erik Darling has created a series on queue tables in the database. Part one builds out a table:

A little explanation of the table: this is good for a reusable, ordered queue, that’ll give us some feedback on how long things took. It could also be used is there were a pending element, but we’d probably wanna change the indexing so that we could find either the last start time, or the last end time efficiently.

The thing that probably needs the most explanation here is the indexing and constraints. Because I know you, and you’re staring at the way I have my primary key, and you’re getting an itch and a twitch. That’s okay, it’s you’ve been conditioned for years to put the most selective column first. Unfortunately, that won’t work here.

Part two takes us through querying the queue:

For the table, our main priorities were indexing to make sure we can find work easily, and not allowing duplicate items.

For our worker proc, the main goals are going to be

– Looping until we run out of work
– Finding and reserving work with minimal locking
– Making that process atomic

Thankfully, this is a lot easier than it sounds. The table design does a lot of the work for us.

Erik’s design is not one I typically reach for, though my constraints are a bit different from his—typically, I’m using queue tables to run on periodic schedules and grab batches of records which finish processing before the next timed batch begins, and processes are idempotent, so if a queued item re-runs on occasion, it’s okay. But this is a really good technique if you need a more robust solution.

Comments closed

Delta Lake Schema Enforcement

Burak Yavuz, et al, explain the concept of schema enforcement with Databricks Delta Lake:

Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema. Like the front desk manager at a busy restaurant that only accepts reservations, it checks to see whether each column in data inserted into the table is on its list of expected columns (in other words, whether each one has a “reservation”), and rejects any writes with columns that aren’t on the list.

Something something “relational database” something something. They also walk us through some examples in a Databricks notebook, so check that out.

Comments closed

Data Layout in R with cdata

John Mount takes us through a few sample problems and how to reshape data with cdata:

This may seem like a lot of steps, but it is only because we are taking the problems very slowly. The important point is that we want to minimize additional problem solving when applying the cdata methodology. Usually when you need to transform data you are in the middle of some other more important task, so you want to delegate the details of how the layout transform is implemented. With cdata the user is not asked to perform additional puzzle solving to guess a sequence of operators that may implement the desired data layout transform. The cdata solution pattern is always the same, which can help in mastering it.

With cdata, record layout transforms are simple R objects with detailed print() methods- so they are convenient to alter, save, and re-use later. The record layout transform also documents the expected columns and constants of the incoming data.

Check it out.

Comments closed

Data Modeling: Not Everything Is VARCHAR(8000)

Bert Wagner gives us several reasons why VARCHAR(8000) can be a bad fit for data types:

When first learning SQL Server, I had the brilliant idea of defining every column on my table as VARCHAR(8000). After all, the VARCHAR datatype only consumes as much storage space as the string itself (plus two bytes), so even if I defined the max length of a column as 8000, I wouldn’t be wasting storage space if my data was smaller than that.

My development life would be easier too since I wouldn’t have to think about the kind of data I was storing in my columns; I could define everything as VARCHAR(8000) once and never have to go back to make any changes. Brilliant!

While I was correct about not wasting storage space, it turns out the idea of making every column VARCHAR(8000) is a terrible idea.

Click through to learn why.

Comments closed

Flattening Dimensional Models

Reza Rad explains why it makes sense to build flat dimensional models, particularly for Power BI:

The article that I wrote earlier this week about the shared dimension had a lot of interest, and I’m glad it helped many of you. So I thought better to write about the basics of modeling even more. In this article, I will be focusing on a scenario that you have all faced, however, took different approaches. Is it good to have too many dimension tables? can you combine some of those tables together to build one flatten dimension table? how much should you flatten it? should you end up with one huge table including everything? In this article, I’m answering all of these questions and explaining the scenarios of combining dimensions, as usual, I explain the model in Power BI. However, the concepts are applicable to any other tools. If you like to learn more about Power BI; read Power BI book from Rookie to Rock Star.

Given how closely the ideal Power BI data model matches the Kimball model, Reza’s advice makes perfect sense.

Comments closed

Building Data Dictionaries

Caitlin Hudon shares some great advice on building data dictionaries:

The best defense I’ve found against relying on an oral history is creating a written one.

Enter the data dictionary. A data dictionary is a “centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format”, and provides us with a framework to store and share all of the institutional knowledge we have about our data.

As part of my role as a lead data scientist as a start-up, building a data dictionary was one of the first tasks I took on (started during my first week on the job). Learning about our data is a crucial part of onboarding for data-focused roles, and documenting that journey in the form of a data dictionary provides a useful data asset for the company (which helps to preserve institutional knowledge) and simultaneously provides a good resource for analyzing the data.

Read the whole thing.

Comments closed