Press "Enter" to skip to content

Category: Data Modeling

Create and Update Columns on Tables

Kenneth Fisher adds four columns to a table:

One of the easiest ways to collect information about table activity is to add a series of audit columns to the table. The most common set of column consists of four columns.

– When was the row created?

– Who created it?

– When was the row last updated?

– Who last updated it?

Read on to see how to create these, but definitely read the comments. Joe Celko has a good reason to avoid this style, and there’s another good reason as well: if you update the columns and the updated values are longer than what they replace, you can end up generating a lot of page splits as full pages get broken up to handle those in-page updates.

If you do need something like this, I’m growing more and more fond of an activity log table which is append-only and tracks these operations separately. That removes the page splits, allows you to deal with deletions, and gives a better idea of what happened over time for a third party. It does come at a cost if you frequently need to show the create and last updated dates (or users), but my experience has been that a vast majority of the time, we include those for internal purposes, not to display to users.

Comments closed

Type 1 SCDs in Delta Lake

Chris Williams starts a series on slowly changing dimensions in a Delta Lake:

Anyone that has contributed towards a Data Warehouse or a dimensional model in Power BI will know the distinction made between the time-series metrics of a Fact Table and the categorised attributes of a Dimension Table. These dimensions are also affected by the passage of time and require revised descriptions periodically which is why they are known as Slowly Changing Dimensions (SCD). See The Data Warehouse Toolkit – Kimball & Ross for more information.

Here is where the Delta Lake comes in. Using its many features such as support for ACID transactions (Atomicity, Consistency, Isolation and Durability) and schema enforcement we can create the same durable SCD’s. This may have required a series of complicated SQL statements in the past to achieve this. I will now discuss a few of the most common SCD’s and show how they can be easily achieved using a few Databricks Notebooks, which are available from my GitHub repo so you can download and have a go:

https://github.com/cwilliams87/Blog-SCDs

Check out the repo, but be sure to read the whole post.

Comments closed

Compressing JSON in SQL Server

Randolph West has a recommendation:

I’ll also pre-emptively note that if this table was simply an append-only archive table, the row size would not really matter. Unfortunately, this table participates in thousands of transactions per day, and as the original developers used Entity Framework and didn’t think much of using NVARCHAR(MAX), the entire row is coming over the wire into the application each time it is queried.

As I’ve written previously about this kind of thing, this is not a good design pattern. Using the VARBINARY(MAX) data type with COMPRESS in the INSERT/UPDATE queries — and DECOMPRESS in the SELECT queries — is a much better design pattern and dramatically reduces the amount of data transferred over the network. Additionally, SQL Server needs significantly less space to store, maintain, and back up this compressed data.

Read on to see the likely benefits from doing this. I’d say that if your main purpose of storing the JSON is just to pass a blob back and forth, then yes, do compress. If you’re frequently shredding these sorts of large documents within SQL Server…well, probably time for a better data model.

Comments closed

Tools for Data Modeling

Rayis Imayev shares information on a few database modeling tools:

A picture is worth a thousand words, the same way a visual database schema is better than a database model communicated by a multitude of data scripting text objects.

Someone may spend a significant amount of time trying to describe all the database tables’ attributes, constraints and relationships between tables with words while a visual Entity Relationship Diagram (ERD) may only take a few minutes (or seconds) to tell the same story.

There are different and similar tools available to help you to create these visual ERD artifacts. The more visual appeal those tools may demonstrate, the fewer efforts would be required to add new information by keyboard-typing. Visual drag-drop experience will prevail, and only to type explicit list of attributes/types/etc. of your data model entities, that’s where your keyboard-typing skills will still be necessary.

Click through for reviews of several tools, ranging from free to expensive.

Comments closed

Comparing Views to Derived Tables and CTEs

Itzik Ben-Gan takes us through views:

As usual when discussing relational theory, we SQL practitioners are often told that the terminology we’re using is wrong. So, in this spirit, right off the bat, I’ll start by saying that when you use the term tables and views, it’s wrong. I’ve learned this from Chris Date.

Recall that a table is SQL’s counterpart to a relation (oversimplifying the discussion around values and variables a bit). A table could be a base table defined as an object in the database, or it could be a table returned by an expression—more specifically, a table expression. That’s similar to the fact that a relation could be one that is returned from a relational expression. A table expression could be a query.

Now, what is a view? It’s a named table expression, much like a CTE is a named table expression. It’s just that like I said, a view is a reusable named table expression that is created as an object in the database, and is accessible to those who have the right permissions. This is all to say, a view is a table. It’s not a base table, but a table nonetheless. So just like saying “a rectangle and a square” or “a whisky and a Lagavulin” would seem strange (unless you had too much Lagavulin!), using “tables and views” is as improper.

Yeah, if we’re going to push our glasses up the bridges of our noses and get all relational here, we’d have the relvar (which is a name and collection of attributes + data types) and the relation (which you can think of as a relvar at a particular point in time—that’s where we get tuples of data). And as Itzik points out, what the RDBMS world calls a view quite neatly fits the definition of a relvar, as we define a name and collection of attributes + data types.

If all of this could not mean less to you, still read the article for Itzik’s view on views.

Comments closed

Empty Strings vs NULL

Erik Darling lays out a challenge:

Empty Strings Aren’t Better Than NULLs

Prove me wrong.

Click through for commenters’ perspectives.

I’m not at the C.J. Date level of “NULLs are a mockery of the entire relational system and should be burned to the ground and that ground be salted and that salt be burned, just in case” but I do have my sympathies there.

An empty string has meaning: the value of a given attribute in this tuple is an empty string. NULL has no value and therefore violates first normal form (which, unlike my joking histrionics above, is an argument Date lays out in far too much detail for me to include here).

So what if you don’t know the value of a thing? Then the answer is, don’t store it! If there are certain attributes which may be missing at insertion time and are nonetheless relevant to maintain, use 6th Normal Form for each of those attributes. Then, the existence of a record indicates that it has a value and the lack of existence indicates that there is no value as of this time. This is quite different from NULL, about which we can only say “I don’t know if there is a value at this time, but if you’d like, I can do weird things with some of your queries if you happen to forget about this non-information.”

3 Comments

Designing and Managing Large Datasets in Power BI

Paul Turley continues a series on doing Power BI the right way:

I was just talking to a consulting client about the best approach to build a data model and he told me something very interesting about the way they were loading data into Power BI. He said “We don’t use facts and dimensions, we load all of our data into one huge table.” He said that their data model performs well and that it meets their reporting needs. It is a difficult point to argue, when something is working at the time although the design might not follow the accepted rules. Life is like that and there are plenty of analogies to make the point that a practice, even a real bad practice, might solve a problem for a period of time and under certain conditions. <analogy>You can drive a car at excessive speed to get to your destination faster. You might not get caught by the police on that day and you might not crash but eventually, if you make it a habit, this practice will catch up to you.</analogy> Data is like that. If you don’t play by the rules, you limit your options. Bending the rules lets you move faster and sometimes with less hassle. But, as the project scope expands – and after adding enough data or other complexities to the solution, it will not endure. The data model won’t perform well, won’t load the correct data or it just won’t be reliable.

This post will explore the realities of best practice design for large data models; some important considerations and trade-off decisions when working with both “big data” and “large data”.

Read on for Paul’s tips.

Comments closed

Data Modeling Essentials in Power BI

Paul Turley continues a series on doing Power BI the right way:

One of the most important lessons I have learned about data modeling over the past 20+ years is that there isn’t one model that fits all business needs. However, there are simple patterns we can follow to model data according to different business needs. Each pattern or schema has a different set of rules. At a very high level, we can think of these schemas in three different categories.

This is the 101 level course, but it’s good to get a refresher on the fundamentals before jumping into the complicated part.

Comments closed

Azure Purview: External Connections and the Starter Kit

Wolfgang Strasser continues a series on Azure Purview. First up is a look at external connections:

During my tests with Azure Purview I found out, that in my demo accounts (I created multiple to test it) the Management menu was missing some items.

Read on to learn why and how you can rectify this. Then, check out Wolfgang’s take on the Starter Kit:

Very often, when I want to test some new services I miss some infrastructure and environments I can start and play with. I am not talking about creating a new Azure Purview account (see my previous blog post – Creating an Azure Purview account) – I am talking about the data infrastructure to analyze, catalog and gain knowledge out of it.

We could start to create such an infrastructure, BUT: the Purview team create a Starter Kit to quickly create a data estate and configure everything that you can start with Purview within a view minutes.

Read on for one issue (of the self-inflicted variety) Wolfgang ran into during deployment. But it does look like a great way to get started with Purview and build up a relevant demo environment.

Comments closed

Power BI the Right Way: Separating Data Models and Reports

Paul Turley continues a series on doing Power BI the right way:

Back in the day, when we created BI solutions, reports and data models were separate. If you created a cube or Tabular model with Analysis Services, it was developed with Visual Studio and deployed to a server. Reports could be authored and deployed to the report server, separately. Now, with Power BI Desktop, you author your data model and report in the same development space and then deploy the whole kit and kaboodle to the service at once.

Secretly, Power BI actually separates the report from the data model (called a dataset in the service) and gives them both the same name. This is very convenient for self-service projects because it is quick and easy to make changes to the data model, queries and measures if you need to make a report enhancement. This is all well and good for small, one developer projects but what about those larger scale solutions where the data model and reports are developed by different folks, or when multiple reports are connected to the same data model?

At what point does it make sense to separate the data model and reports into separate files?

Read on and let Paul illuminate.

Comments closed