Press "Enter" to skip to content

Category: Data Modeling

Tuning a Range Query

Grant Fritchey gets forum-sniped:

Recently I was looking through DBA.StackExchange when I saw a pretty simple question that I decided to answer. I went off, set up a test database and some test tables and quickly wrote a query to answer the question. I got it all formatted pretty and was on my way to post it when I saw that another answer was already there.

Yeah. Identical to mine. Almost line for line.

Well, nuts.

I know. I’ll write a blog post.

In thinking about the problem, the thing that caught my mind was Grant’s comment about poor design. This got me thinking about one of my favorite topics: orthogonal design for relational excellence. The idea of a BETWEEN table of [ MinValue : MaxValue ] is the first thing people think of but is also the worst because you have two big problems: gaps and overlap.

The second solution is to use MinValue and calculate MaxValue (if we actually need it) as LEAD(MinValue) OVER (ORDER BY MinValue) - e, where e represents the smallest reasonable increment we’d need. Queries would find, for each Value in the main table, the largest MinValue below Value. That removes gaps and overlap but might be a performance concern as the main table’s data size grows.

The big-brain solution, which generally works best when you have a discrete number of options, could be a tally table. In Grant’s example, we see values from 1 to 1000, with a rank for each. If it’s really as simple as that, we would create a new lookup table with Value + RankDesc and simply join the main table’s Value to the lookup table’s Value to get the appropriate RankDesc. Yeah, you have 1000 rows instead of 3 but queries are trivial at that point. The downside is that this approach doesn’t work for continuous variables (e.g., give me the exact amount of your household income for the prior tax year) and the utility of this solution probably breaks down once you get past tens of thousands of rows.

In the case of a continuous variable or an enormous discrete variable, we have the simplest option of all: ignore something. If you care about the range, use the table from the second solution and use that ID on the main table. If you care about the value but not the range, just have the value and no lookup table.

Comments closed

Have One Data Model per Business Area

James McGillivray offers us an important piece of advice:

I cannot stress this enough. If people are consuming your data in multiple places, the data needs to come from the same data model. That can be an Enterprise Data Warehouse, a Data Mart, a Power BI Model, or any other data source, but at some point you need to be able to track the data back to a single place. If you don’t do this, you will spend THE REST OF YOUR DAYS explaining the differences between the data models to business and customers, and reconciling the differences over and over again.

Read on to learn why this is so important.

Comments closed

Data Modeling with Spark–Breaking Data into Multiple Tables

Landon Robinson tokenizes data:

The result of joining the 2 DataFrames – pets and colorsdisplays the nicknamecolor and age of the pets. We went from a normalized dataset where common & recurring values weresubstituted for numeric representation s— to a slightly more denormalized dataset. Let’s keep going!

This is an interesting example of a useful technique but I strongly disagree with Landon about whether this is normalization. Translating a natural key to a surrogate key is not normalizing the data and translating a surrogate key to a natural key (which is what the example does) is not denormalizing the data. A really simplified explanation of the process is that normalization is ensuring that like things are grouped together, not that we build key-value lookup tables for everything. That’s why Landon’s “denormalized” example is just as normalized as the original: each of those attributes describes a unique thing about the pet identified by its (unique) nickname. This would be different if we included things like owner’s name (which could still be on that table), owner’s age, owner’s height, a list of visits to the vet for each pet, when the veterinarians received their licenses, etc.

Comments closed

Unicode and Data Length

Kevin Wilkie lays out an argument:

If you truly need the UNICODE characters in your data, go ahead and use them! If not though, please make your DBA happy by not using them. Since UNICODE characters take up twice the amount of space as the ASCII versions do, then your DBAs will recommend to use the ASCII versions if you are not going to be using any UNICODE characters.

Read on for the justification. But I’m still NVARCHAR (Almost) Everywhere.

Comments closed

Database Schema Types

Steve Jones explains schema types:

OLTP/Relational

The type of schema that many of us work with is the standard OLTP or relational model. We have lots of transaction tables, most should have a PK, some of which have PKs. The schema expands to meet different needs and can have lots of entities.

It may just be the time of morning but “Galaxy schema” sounds dumb specifically because the Kimball style of star schema implicitly includes what the galaxy schema shows. Dimensions are conformed, which means they apply across facts, which implies that there may be multiple facts in the schema design. This means that galaxy schemas necessarily star schemas. For the sake of education, we tend to focus on one fact table but a star schema with two fact tables is still a star schema.

Anyhow, that’s my minor rant of the day. It’s not Steve’s fault somebody misunderstood the concept of star schemas and began promulgating this unnecessary term.

Comments closed

Swap an Identity Column with a Sequence

Greg Larsen looks at sequences:

Microsoft introduced the sequence number objects starting with SQL Server 2012. A sequence object generates sequence numbers based on starting and increment values, similar to an identity column, but it has additional features. Over time, you might find that the additional benefits of a sequence number have you wanting to replace an identity column with a sequence number. This article demonstrates two options for replacing an identity column with a sequence number.

When it came out, I thought I was going to use sequences a quite often. In practice, I’ve used it a few times, but IDENTITY is so much easier to type and I’ve rarely needed sequence generators. That said, the times I have needed them, I definitely appreciate their existence.

Comments closed

The Importance of Data Dictionaries

John Morehouse takes us through data dictionaries:

Data professionals—whether they’re database administrators (DBAs), developers, or data scientists—work in a wide and varied landscape usually in flux and filled with challenges. These challenges could range from changing business requirements to keeping up with the sheer velocity at which technology evolves.

It’s also critical for these professionals to understand their organization’s data and how it applies to a given application or business unit. Better outcomes usually come from employing data dictionaries throughout the organization. Through many years of experience in IT professions, I’ve seen the utilization of data dictionaries range from “not at all” to “I’m documenting every possible data attribute known to humankind.” In my experience, data-related projects with data dictionaries as part of the process are far more likely to be successful than projects without them (even extremely populated data dictionaries are more useful than nothing at all). Trust me on this.

Click through to understand why you should trust John on this one.

Comments closed

Sixth Normal Form to Avoid NULLs

I have a response to an article:

I linked to this on Curated SQL, where I’d started to write out a response. After about four paragraphs of that, I decided that maybe it’d make sense to turn this into a full blog post rather than a mini-commentary, as I think it deserves the lengthier treatment. I’m going to assume that you’ve read Aaron’s post first, and it’s a well-done apologia in support of using NULLs pragmatically. I’ll start my response with a point of agreement, but then move to differences and alternatives before laying out where I see additional common ground between Aaron’s and my thoughts on the matter.

One explicit assumption in here is that you don’t really have a large number of nullable (or NULLable, as long-form blogging me wants to write) columns on a given table. 6NF-style tables for nullable attributes is a lot less tenable when you have 15 or 20 distinct nullable columns on a table, but at that point I have to ask, is your data model actually correct if you have that many missing attributes?

Comments closed

Against Reporting Tables

Erik Darling doesn’t like reporting tables:

I’ve seen a lot of crazy ways for people to run reports on live data, but one of the worst ideas is creating a reporting table based on the set of data a user wants, and then letting them query that data.

As usual, Erik says something I want to disagree with, and then I read the post and don’t really disagree with him—or if I do, he’s already laid out the “Yes, I understand X” exception. I’ve used reporting tables to good effect, but the important thing is that they’re general-purpose and designed into the application, not specific to a single user.

Comments closed

Database Deployment with External References

Sebastian Meine and Liz Baron try to untangle the Gordian knot:

Most database developers are dealing with databases that contain external references. Even if the database code is in source control, these external references can make it very difficult to deploy to new environments. In these multi-database environments, tools like SQLCompare and SQL Change Automation do not automatically resolve object-order across databases, resulting in errors during deployment.

One way to tackle this, which works especially well for CI pipelines, is to create facades for all externally referenced databases. A facade in this context is a database with the expected name, with the expected objects, but those objects are hollowed out and do not contain any dependencies. You can compare this concept to an interface in an object-oriented language. Once you have these facades, they can be used in a pre-deployment step, simplifying the rest of the deployment by effectively removing object-order dependencies with these external databases.

This is one of the most painful parts of converting existing databases into model-driven database development. Especially once you start having to deal with cross-dependencies and rapidly-changing databases.

Comments closed