Press "Enter" to skip to content

Category: Data Modeling

Generating Realistic-Looking Data With Markov Chains

Phil Factor shows how to use Markov chain generation in T-SQL to generate realistic-looking country names:

How did we do this? We started with a table that took each word, added two spaces at the beginning and a |, followed by two subsequent spaces, at the end. This allowed us to map the frequency of each three-letter combination in a collection of words. Any language is made up of common combinations of characters with a few wild exceptions. For words to look right, they must follow this distribution. This distribution will change in various parts of a word, so you need all this information.

So what would happen if, instead of feeding the name of countries into the batch, we do the names of people?

My favorite name from the list was Kuwatian Samoa.

Comments closed

Closure Tables: Graph Data In Relational Form

Phil Factor shows how to use the concept of closure tables to represent graph-style data in a relational database:

Closure tables are plain ordinary relational tables that are designed to work easily with relational operations. It is true that useful extensions are provided for SQL Server to deal with hierarchies. The HIERARCHYID data type and the common language runtime (CLR) SqlHierarchyId class are provided to support the Path Enumeration method of representing hierarchies and are intended to make tree structures represented by self-referencing tables more efficient, but they are likely to be appropriate for some but not all the practical real-life hierarchies or directories. As well as path enumerations, there are also the well-known design patterns of Nested Sets and Adjacency Lists. In this article, we’ll concentrate on closure tables.

A directed acyclic graph (DAG) is a more general version of a closure table. You can use a closure table for a tree structure where there is only one trunk, because a branch or leaf can only have one trunk. We just have a table that has the nodes (e.g. staff member or directory ‘folder’) and edges (the relationships). We are representing an acyclic (no loops allowed) connected graph where the edges must all be unique, and where there is reflexive closure. (each node has an edge pointing to itself)

Take the time to read this one carefully, as I think this model is applicable much more often than it’d appear at first blush.

Comments closed

SQL Server’s Referential Integrity Operator

Joe Obbish explains the purpose of the referential integrity operator in SQL Server 2016:

What would happen if a parent table was referenced by hundreds of child tables, such as for a date dimension table? Deleting or updating a row in the parent table would create a query plan with at least one join per incoming foreign key reference. Creating a query plan for that statement is equivalent to creating a query plan for a query containing hundreds or even thousands of joins. That query plan could take a long time to compile or could even time out. For example, I created a simple query with 2500 joins and it still hadn’t finished compiling after 15 minutes. That’s why I assume a table is limited to 253 incoming foreign key references in SQL Server 2014.

That restriction won’t be hit often but could be pretty inconvenient to work around. The referential integrity operator introduced with compatibility level 130 raises the limit from 253 to 10000. All of the joins are collapsed into a single operator which can reduce compile time and avoid errors.

There’s some really good information in this post, and Joe has mixed feelings on the concept.

Comments closed

Thinking About The Data Lake

James Serra explains at a high level what the data lake metaphor is and how it works:

The data lake introduces a new data analysis paradigm shift:

OLD WAY: Structure -> Ingest -> Analyze

NEW WAY: Ingest -> Analyze -> Structure

This allows you to avoid a lot of up-front work before you are able to analyze data.  With the old way, you have to know the questions to ask.  The new way supports situations when you don’t know the questions to ask.

This solves the two biggest reasons why many EDW projects fail:

  • Too much time spent modeling when you don’t know all of the questions your data needs to answer

  • Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value

There are some good details here.  My addition would be to reiterate the importance of a good data governance policy.

Comments closed

Re-Shaping Data Flows

Maneesh Varshney explains some methods to trim the fat out of analytical data flows:

Big data comes in a variety of shapes. The Extract-Transform-Load (ETL) workflows are more or less stripe-shaped (left panel in the figure above) and produce an output of a similar size to the input. Reporting workflows are funnel-shaped (middle panel in the figure above) and progressively reduce the data size by filtering and aggregating.

However, a wide class of problems in analytics, relevance, and graph processing have a rather curious shape of widening in the middle before slimming down (right panel in the figure above). It gets worse before it gets better.

In this article, we take a deeper dive into this exploding middle shape: understanding why it happens, why it’s a problem, and what can we do about it. We share our experiences of real-life workflows from a spectrum of fields, including Analytics (A/B experimentation), Relevance (user-item feature scoring), and Graph (second degree network/friends-of-friends).

The examples relate directly to Hadoop, but are applicable in other data platforms as well.

Comments closed

Trimming The Fat From Power BI Data Models

Philip Seamark shows how things like getting rid of ID columns can reduce a Power BI data model’s size significantly:

Once saved, the Power BI file size was 289MB!  Is this good for 10 million rows?  It’s certainly better than the 360MB CSV file but not by much.  Certainly not close to the 10:1 compression claimed to be achievable using the SSAS Tabular engine used by Power BI.

I think we can do better than that….

Read on to see the specific optimizations, turning this from a 289 MB data model into a 9 MB data model.

Comments closed

Dimensional Design Tips

Koen Verbeeck provides some helpful hints when designing dimensions in SQL Server Analysis Services Multidimensional models:

Although traditional dimension modeling – as explained by Ralph Kimball – tries to avoid snowflaking, it might help the processing of larger dimensions. For example, suppose you have a large customer dimension with over 10 million members. One attribute is the customer country. Realistically, there should only be a bit over 200 countries, maximum. When SSAS processes the dimension, it sends SELECT DISTINCT commands to SQL Server. Such a query on top of a large dimension might take some time. However, if you would snowflake (aka normalize) the country attribute into another dimension, the SELECT DISTINCT will run much faster. Here, you need to trade-off performance against the simplicity of your design.

There are several good tips here.

Comments closed

Calculated Dimensions

Ginger Grant shows how calculated dimensions can solve the classic role-playing dimension problem in Analysis Services Tabular:

Working with role playing dimensions, which are found when you have say multiple dates in a table and you want to relate them back to a single date table, have always been problematic in SQL Server Analysis Services Tabular. Tabular models only allow one active relationship to a single column at a time. The picture on the left shows how tabular models represent a role playing dimension, and the model on the right is the recommended method for how to model the relationships in Analysis Services Tabular as then users can filter the data on a number of different date tables.

The big downside to this is one has to import the date table into the model multiple times, meaning the same data is imported again and again. At least that was the case until SQL Server 2016 was released. This weeks TSQL topic Fixing Old Problems with Shiny New Toys is really good reason to describe a better way of handling this problem.

Read on for how to implement calculated dimensions.

Comments closed

Parsing Text Fragments

Aaron Bertrand looks at a way of speeding up LIKE %Something% queries and builds a fragment table:

It’s clear that in this specific case – with an address column of nvarchar(60) and a max length of 26 characters – breaking up each address into fragments can bring some relief to otherwise expensive “leading wildcard” searches. The better payoff seems to happen when the search pattern is larger and, as a result, more unique. I’ve also demonstrated why EXISTS is better in scenarios where multiple matches are possible – with a JOIN, you will get redundant output unless you add some “greatest n per group” logic.

Read the whole thing.  If you’re interested in the concept, I recommend reading up on n-grams, like Alan Burstein’s series and this TechNet article on implementing N-Grams in SQL Server.

Comments closed

Testing Nested Sets

Nate Johnson continues his nested set series:

Now the complex.  To validate that our tree is properly structured, the following statements need to be true:

  1. Each node’s Right value is greater than its Left.

  2. More to the point, each node’s Right value is greater than all of its ancestors’ Left values.

  3. Similarly, each node’s Left value is less than all of its descendants’ Left values (and Right values, obviously!)

  4. Leaf nodes have no gaps between Left & Right: Right = Left + 1

  5. Depth is easy to verify because we already wrote the rCTE to calculate it!

  6. And of course, no orphans – all ParentIDs lead to an actual parent node, except of course if they’re NULL (root nodes).

Read on for further explanation of these points.

Comments closed