Press "Enter" to skip to content

Category: Data Modeling

Unicode and Data Length

Kevin Wilkie lays out an argument:

If you truly need the UNICODE characters in your data, go ahead and use them! If not though, please make your DBA happy by not using them. Since UNICODE characters take up twice the amount of space as the ASCII versions do, then your DBAs will recommend to use the ASCII versions if you are not going to be using any UNICODE characters.

Read on for the justification. But I’m still NVARCHAR (Almost) Everywhere.

Comments closed

Database Schema Types

Steve Jones explains schema types:

OLTP/Relational

The type of schema that many of us work with is the standard OLTP or relational model. We have lots of transaction tables, most should have a PK, some of which have PKs. The schema expands to meet different needs and can have lots of entities.

It may just be the time of morning but “Galaxy schema” sounds dumb specifically because the Kimball style of star schema implicitly includes what the galaxy schema shows. Dimensions are conformed, which means they apply across facts, which implies that there may be multiple facts in the schema design. This means that galaxy schemas necessarily star schemas. For the sake of education, we tend to focus on one fact table but a star schema with two fact tables is still a star schema.

Anyhow, that’s my minor rant of the day. It’s not Steve’s fault somebody misunderstood the concept of star schemas and began promulgating this unnecessary term.

Comments closed

Swap an Identity Column with a Sequence

Greg Larsen looks at sequences:

Microsoft introduced the sequence number objects starting with SQL Server 2012. A sequence object generates sequence numbers based on starting and increment values, similar to an identity column, but it has additional features. Over time, you might find that the additional benefits of a sequence number have you wanting to replace an identity column with a sequence number. This article demonstrates two options for replacing an identity column with a sequence number.

When it came out, I thought I was going to use sequences a quite often. In practice, I’ve used it a few times, but IDENTITY is so much easier to type and I’ve rarely needed sequence generators. That said, the times I have needed them, I definitely appreciate their existence.

Comments closed

The Importance of Data Dictionaries

John Morehouse takes us through data dictionaries:

Data professionals—whether they’re database administrators (DBAs), developers, or data scientists—work in a wide and varied landscape usually in flux and filled with challenges. These challenges could range from changing business requirements to keeping up with the sheer velocity at which technology evolves.

It’s also critical for these professionals to understand their organization’s data and how it applies to a given application or business unit. Better outcomes usually come from employing data dictionaries throughout the organization. Through many years of experience in IT professions, I’ve seen the utilization of data dictionaries range from “not at all” to “I’m documenting every possible data attribute known to humankind.” In my experience, data-related projects with data dictionaries as part of the process are far more likely to be successful than projects without them (even extremely populated data dictionaries are more useful than nothing at all). Trust me on this.

Click through to understand why you should trust John on this one.

Comments closed

Sixth Normal Form to Avoid NULLs

I have a response to an article:

I linked to this on Curated SQL, where I’d started to write out a response. After about four paragraphs of that, I decided that maybe it’d make sense to turn this into a full blog post rather than a mini-commentary, as I think it deserves the lengthier treatment. I’m going to assume that you’ve read Aaron’s post first, and it’s a well-done apologia in support of using NULLs pragmatically. I’ll start my response with a point of agreement, but then move to differences and alternatives before laying out where I see additional common ground between Aaron’s and my thoughts on the matter.

One explicit assumption in here is that you don’t really have a large number of nullable (or NULLable, as long-form blogging me wants to write) columns on a given table. 6NF-style tables for nullable attributes is a lot less tenable when you have 15 or 20 distinct nullable columns on a table, but at that point I have to ask, is your data model actually correct if you have that many missing attributes?

Comments closed

Against Reporting Tables

Erik Darling doesn’t like reporting tables:

I’ve seen a lot of crazy ways for people to run reports on live data, but one of the worst ideas is creating a reporting table based on the set of data a user wants, and then letting them query that data.

As usual, Erik says something I want to disagree with, and then I read the post and don’t really disagree with him—or if I do, he’s already laid out the “Yes, I understand X” exception. I’ve used reporting tables to good effect, but the important thing is that they’re general-purpose and designed into the application, not specific to a single user.

Comments closed

Database Deployment with External References

Sebastian Meine and Liz Baron try to untangle the Gordian knot:

Most database developers are dealing with databases that contain external references. Even if the database code is in source control, these external references can make it very difficult to deploy to new environments. In these multi-database environments, tools like SQLCompare and SQL Change Automation do not automatically resolve object-order across databases, resulting in errors during deployment.

One way to tackle this, which works especially well for CI pipelines, is to create facades for all externally referenced databases. A facade in this context is a database with the expected name, with the expected objects, but those objects are hollowed out and do not contain any dependencies. You can compare this concept to an interface in an object-oriented language. Once you have these facades, they can be used in a pre-deployment step, simplifying the rest of the deployment by effectively removing object-order dependencies with these external databases.

This is one of the most painful parts of converting existing databases into model-driven database development. Especially once you start having to deal with cross-dependencies and rapidly-changing databases.

Comments closed

Create and Update Columns on Tables

Kenneth Fisher adds four columns to a table:

One of the easiest ways to collect information about table activity is to add a series of audit columns to the table. The most common set of column consists of four columns.

– When was the row created?

– Who created it?

– When was the row last updated?

– Who last updated it?

Read on to see how to create these, but definitely read the comments. Joe Celko has a good reason to avoid this style, and there’s another good reason as well: if you update the columns and the updated values are longer than what they replace, you can end up generating a lot of page splits as full pages get broken up to handle those in-page updates.

If you do need something like this, I’m growing more and more fond of an activity log table which is append-only and tracks these operations separately. That removes the page splits, allows you to deal with deletions, and gives a better idea of what happened over time for a third party. It does come at a cost if you frequently need to show the create and last updated dates (or users), but my experience has been that a vast majority of the time, we include those for internal purposes, not to display to users.

Comments closed

Type 1 SCDs in Delta Lake

Chris Williams starts a series on slowly changing dimensions in a Delta Lake:

Anyone that has contributed towards a Data Warehouse or a dimensional model in Power BI will know the distinction made between the time-series metrics of a Fact Table and the categorised attributes of a Dimension Table. These dimensions are also affected by the passage of time and require revised descriptions periodically which is why they are known as Slowly Changing Dimensions (SCD). See The Data Warehouse Toolkit – Kimball & Ross for more information.

Here is where the Delta Lake comes in. Using its many features such as support for ACID transactions (Atomicity, Consistency, Isolation and Durability) and schema enforcement we can create the same durable SCD’s. This may have required a series of complicated SQL statements in the past to achieve this. I will now discuss a few of the most common SCD’s and show how they can be easily achieved using a few Databricks Notebooks, which are available from my GitHub repo so you can download and have a go:

https://github.com/cwilliams87/Blog-SCDs

Check out the repo, but be sure to read the whole post.

Comments closed

Compressing JSON in SQL Server

Randolph West has a recommendation:

I’ll also pre-emptively note that if this table was simply an append-only archive table, the row size would not really matter. Unfortunately, this table participates in thousands of transactions per day, and as the original developers used Entity Framework and didn’t think much of using NVARCHAR(MAX), the entire row is coming over the wire into the application each time it is queried.

As I’ve written previously about this kind of thing, this is not a good design pattern. Using the VARBINARY(MAX) data type with COMPRESS in the INSERT/UPDATE queries — and DECOMPRESS in the SELECT queries — is a much better design pattern and dramatically reduces the amount of data transferred over the network. Additionally, SQL Server needs significantly less space to store, maintain, and back up this compressed data.

Read on to see the likely benefits from doing this. I’d say that if your main purpose of storing the JSON is just to pass a blob back and forth, then yes, do compress. If you’re frequently shredding these sorts of large documents within SQL Server…well, probably time for a better data model.

Comments closed