Press "Enter" to skip to content

Category: Data Modeling

Computed Columns in Snowflake

Kevin Wilkie does the math:

Sometimes to make our lives easier, we, as database engineers, can create a table that automatically tells us the answer as we need it – or at least how we tell it we want it. In SQL Server, we create what is called “Computed Columns.”

Read on to see how to create one of these in Snowflake.

Comments closed

DirectQuery Data Modeling

Jason Cockington share some advice:

From my experience, most people who have reports built on a DirectQuery connection into their data source did so because of a lack of understanding of what the DirectQuery connection was designed to achieve.  For the vast majority of reports, Import mode is the best solution for working with data in Power BI.  DirectQuery should really only ever be applied when you are trying to solve one of the following challenges.

  1. Real-time Data – you need to see the latest available data from the source
  2. Huge Datasets – you have many billions of rows of data (more than 10Gb) so you just can’t import it into Power BI
  3. Regulatory Compliance – the data must stay in the source for data security/privacy reasons

Click through for more information.

Comments closed

PayPal’s Data Contract Template Open Sourced

Jean-Georges Perrin makes an announcement:

A data contract is a binding agreement between the consumers and producers of data. You can see it as a data schema on steroids or data schema++. The goal of the contract is to set expectations between the parties. It can be built as fit-for-purpose where the consumers and producer agree on what it should contain or can serve as a brochure for any consumer willing to access the data offered by this (data) product.

Click through to learn more about data contracts and then check out the contract template itself on PayPal’s GitHub repo.

Comments closed

Route Planning in Postgres

Mark Litwintschik plans a journey:

I recently came across a transit route feed aggregator called Transitland. They list feeds from 2,500 operators in 55+ countries around the world. Among these feeds is one for FlixBus, a 12-year-old coach service provider. Below is a route map of their European destinations.

In this post, I’ll import their feed into PostgreSQL, build visualisations of their routes and plan a bus trip from Vienna to Oslo.

Read on for the process.

Comments closed

Recommendations for Dedicated SQL Pool Data Modeling

Bhaskar Sharma has some advice:

In this article, I will discuss how to physically model an Azure Synapse Analytics data warehouse while migrating from an existing on-premises MPP (Massive Parallel Processing) data warehouse solution like Teradata and Netezza. The approach and methodologies discussed in this article are purely based on the knowledge and insight I have gained while migrating these data warehouses to Azure Synapse dedicated SQL pool. 

Dedicated SQL pools are close enough to regular SQL Server that we make a lot of assumptions about it, some of which may be wrong.

Comments closed

Degenerate Dimensions and Power BI DirectQuery

Chris Webb provides a warning:

A few weeks ago my colleague Dany Hoter wrote a post on the Azure Data Explorer blog about how using columns on fact tables as dimensions in DirectQuery mode can lead to errors in Power BI. You can read it here:

https://techcommunity.microsoft.com/t5/azure-data-explorer-blog/to-star-or-not-to-star-more-about-data-modeling-in-power-bi/ba-p/3689933

In the post he mentioned that he could reproduce the same behaviour in SQL Server, so I thought it would be good to show an example of this to raise awareness of the issue because I think it’s one that anyone using DirectQuery mode on any data source is likely to run into.

Read on to understand what might lead to (expected) errors and what you can do about it.

Comments closed

Bitemporal Modeling and Running Totals

John Mount solves a running total problem in Python:

An example of this is wanting to know any many reservations for a San Francisco Symphony concert scheduled for December 4th 2022 are known to have been made by October 22nd 2022. This could be used as part of an attendance demand model that is evaluated on October 22nd 2022. The “fifty-cent word” for this is “bitemporal” modeling or data.

As I read through the solution, my initial thought is that, if the data is in a relational database, a running total operation SUM(reservation_count) OVER (PARTITION BY target_date ORDER BY action_date ROWS BETWEEN UNBOUNDED PRECEDING TO CURRENT ROW) would form the basis of a solution. Still, this is an interesting exercise in translating a SQL operation into equivalent Python and just how much we get to take for granted.

Comments closed

Tuning a Range Query

Grant Fritchey gets forum-sniped:

Recently I was looking through DBA.StackExchange when I saw a pretty simple question that I decided to answer. I went off, set up a test database and some test tables and quickly wrote a query to answer the question. I got it all formatted pretty and was on my way to post it when I saw that another answer was already there.

Yeah. Identical to mine. Almost line for line.

Well, nuts.

I know. I’ll write a blog post.

In thinking about the problem, the thing that caught my mind was Grant’s comment about poor design. This got me thinking about one of my favorite topics: orthogonal design for relational excellence. The idea of a BETWEEN table of [ MinValue : MaxValue ] is the first thing people think of but is also the worst because you have two big problems: gaps and overlap.

The second solution is to use MinValue and calculate MaxValue (if we actually need it) as LEAD(MinValue) OVER (ORDER BY MinValue) - e, where e represents the smallest reasonable increment we’d need. Queries would find, for each Value in the main table, the largest MinValue below Value. That removes gaps and overlap but might be a performance concern as the main table’s data size grows.

The big-brain solution, which generally works best when you have a discrete number of options, could be a tally table. In Grant’s example, we see values from 1 to 1000, with a rank for each. If it’s really as simple as that, we would create a new lookup table with Value + RankDesc and simply join the main table’s Value to the lookup table’s Value to get the appropriate RankDesc. Yeah, you have 1000 rows instead of 3 but queries are trivial at that point. The downside is that this approach doesn’t work for continuous variables (e.g., give me the exact amount of your household income for the prior tax year) and the utility of this solution probably breaks down once you get past tens of thousands of rows.

In the case of a continuous variable or an enormous discrete variable, we have the simplest option of all: ignore something. If you care about the range, use the table from the second solution and use that ID on the main table. If you care about the value but not the range, just have the value and no lookup table.

Comments closed

Have One Data Model per Business Area

James McGillivray offers us an important piece of advice:

I cannot stress this enough. If people are consuming your data in multiple places, the data needs to come from the same data model. That can be an Enterprise Data Warehouse, a Data Mart, a Power BI Model, or any other data source, but at some point you need to be able to track the data back to a single place. If you don’t do this, you will spend THE REST OF YOUR DAYS explaining the differences between the data models to business and customers, and reconciling the differences over and over again.

Read on to learn why this is so important.

Comments closed

Data Modeling with Spark–Breaking Data into Multiple Tables

Landon Robinson tokenizes data:

The result of joining the 2 DataFrames – pets and colorsdisplays the nicknamecolor and age of the pets. We went from a normalized dataset where common & recurring values weresubstituted for numeric representation s— to a slightly more denormalized dataset. Let’s keep going!

This is an interesting example of a useful technique but I strongly disagree with Landon about whether this is normalization. Translating a natural key to a surrogate key is not normalizing the data and translating a surrogate key to a natural key (which is what the example does) is not denormalizing the data. A really simplified explanation of the process is that normalization is ensuring that like things are grouped together, not that we build key-value lookup tables for everything. That’s why Landon’s “denormalized” example is just as normalized as the original: each of those attributes describes a unique thing about the pet identified by its (unique) nickname. This would be different if we included things like owner’s name (which could still be on that table), owner’s age, owner’s height, a list of visits to the vet for each pet, when the veterinarians received their licenses, etc.

Comments closed