When first learning SQL Server, I had the brilliant idea of defining every column on my table as VARCHAR(8000). After all, the VARCHAR datatype only consumes as much storage space as the string itself (plus two bytes), so even if I defined the max length of a column as 8000, I wouldn’t be wasting storage space if my data was smaller than that.
My development life would be easier too since I wouldn’t have to think about the kind of data I was storing in my columns; I could define everything as VARCHAR(8000) once and never have to go back to make any changes. Brilliant!
While I was correct about not wasting storage space, it turns out the idea of making every column VARCHAR(8000) is a terrible idea.
Click through to learn why.
The article that I wrote earlier this week about the shared dimension had a lot of interest, and I’m glad it helped many of you. So I thought better to write about the basics of modeling even more. In this article, I will be focusing on a scenario that you have all faced, however, took different approaches. Is it good to have too many dimension tables? can you combine some of those tables together to build one flatten dimension table? how much should you flatten it? should you end up with one huge table including everything? In this article, I’m answering all of these questions and explaining the scenarios of combining dimensions, as usual, I explain the model in Power BI. However, the concepts are applicable to any other tools. If you like to learn more about Power BI; read Power BI book from Rookie to Rock Star.
Given how closely the ideal Power BI data model matches the Kimball model, Reza’s advice makes perfect sense.
The best defense I’ve found against relying on an oral history is creating a written one.
Enter the data dictionary. A data dictionary is a “centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format”, and provides us with a framework to store and share all of the institutional knowledge we have about our data.
As part of my role as a lead data scientist as a start-up, building a data dictionary was one of the first tasks I took on (started during my first week on the job). Learning about our data is a crucial part of onboarding for data-focused roles, and documenting that journey in the form of a data dictionary provides a useful data asset for the company (which helps to preserve institutional knowledge) and simultaneously provides a good resource for analyzing the data.
Read the whole thing.
Before we continue, let me ask you one question, have you heard about Silverlight?
Or in other words, and with a kind of evil voice “DID YOU EVER INSTALLED SILVERLIGHT ON A PRODUCTION SERVER”?.
If you have worked with MDS oh yes, you did! At least in order to check if everything is configured/upgraded correctly and nothing is broke, I will do a wild guess and claim that you did! So am I … :s
Because in order to make things work in MDS correctly, one needs this old, for a very long time deprecated framework, that is supported only in deprecated browser that is called Internet Explorer v.11, and that pain-in-the-neck framework is called Silverlight and if you dare to work with any SQL Server versions before SQL Server 2019, the picture on the left will appear in front of you at the moment you will try to explore the master data in the MDS Explorer – ensuring that unless you install a totally abandoned (and obviously unnecessary product, that represents another risk on your server) is a necessary thing. That is alone is the reason for some people would use some development VM in order to work with MDS, but that is not a good excuse to include that product in SQL Server 2016 or in SQL Server 2017.
The interface still has problems, as Niko points out, but hopefully this is the first step and not the last one.
The July 2018 preview of Power BI Desktop delivers two killer preview features that solidify the Power BI position as the best data modeling tool on the market. First, Microsoft relaxes the Power BI relationship limitations by letting you create M:M joins between two tables. Second, you can now create a composite (hybrid) data model with different storage modes, such as from an SQL Server database configured for DirectQuery and from an imported text file. Without reiterating the documentation, here are some important notes to keep in mind based on my pre-release testing.
But read the whole thing, which includes some limitations around hybrid data models.
How did we do this? We started with a table that took each word, added two spaces at the beginning and a |, followed by two subsequent spaces, at the end. This allowed us to map the frequency of each three-letter combination in a collection of words. Any language is made up of common combinations of characters with a few wild exceptions. For words to look right, they must follow this distribution. This distribution will change in various parts of a word, so you need all this information.
So what would happen if, instead of feeding the name of countries into the batch, we do the names of people?
My favorite name from the list was Kuwatian Samoa.
Closure tables are plain ordinary relational tables that are designed to work easily with relational operations. It is true that useful extensions are provided for SQL Server to deal with hierarchies. The HIERARCHYID data type and the common language runtime (CLR) SqlHierarchyId class are provided to support the Path Enumeration method of representing hierarchies and are intended to make tree structures represented by self-referencing tables more efficient, but they are likely to be appropriate for some but not all the practical real-life hierarchies or directories. As well as path enumerations, there are also the well-known design patterns of Nested Sets and Adjacency Lists. In this article, we’ll concentrate on closure tables.
A directed acyclic graph (DAG) is a more general version of a closure table. You can use a closure table for a tree structure where there is only one trunk, because a branch or leaf can only have one trunk. We just have a table that has the nodes (e.g. staff member or directory ‘folder’) and edges (the relationships). We are representing an acyclic (no loops allowed) connected graph where the edges must all be unique, and where there is reflexive closure. (each node has an edge pointing to itself)
Take the time to read this one carefully, as I think this model is applicable much more often than it’d appear at first blush.
What would happen if a parent table was referenced by hundreds of child tables, such as for a date dimension table? Deleting or updating a row in the parent table would create a query plan with at least one join per incoming foreign key reference. Creating a query plan for that statement is equivalent to creating a query plan for a query containing hundreds or even thousands of joins. That query plan could take a long time to compile or could even time out. For example, I created a simple query with 2500 joins and it still hadn’t finished compiling after 15 minutes. That’s why I assume a table is limited to 253 incoming foreign key references in SQL Server 2014.
That restriction won’t be hit often but could be pretty inconvenient to work around. The referential integrity operator introduced with compatibility level 130 raises the limit from 253 to 10000. All of the joins are collapsed into a single operator which can reduce compile time and avoid errors.
There’s some really good information in this post, and Joe has mixed feelings on the concept.
The data lake introduces a new data analysis paradigm shift:
OLD WAY: Structure -> Ingest -> Analyze
NEW WAY: Ingest -> Analyze -> Structure
This allows you to avoid a lot of up-front work before you are able to analyze data. With the old way, you have to know the questions to ask. The new way supports situations when you don’t know the questions to ask.
This solves the two biggest reasons why many EDW projects fail:
Too much time spent modeling when you don’t know all of the questions your data needs to answer
Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value
There are some good details here. My addition would be to reiterate the importance of a good data governance policy.
Big data comes in a variety of shapes. The Extract-Transform-Load (ETL) workflows are more or less stripe-shaped (left panel in the figure above) and produce an output of a similar size to the input. Reporting workflows are funnel-shaped (middle panel in the figure above) and progressively reduce the data size by filtering and aggregating.
However, a wide class of problems in analytics, relevance, and graph processing have a rather curious shape of widening in the middle before slimming down (right panel in the figure above). It gets worse before it gets better.
In this article, we take a deeper dive into this exploding middle shape: understanding why it happens, why it’s a problem, and what can we do about it. We share our experiences of real-life workflows from a spectrum of fields, including Analytics (A/B experimentation), Relevance (user-item feature scoring), and Graph (second degree network/friends-of-friends).
The examples relate directly to Hadoop, but are applicable in other data platforms as well.