Junk Dimensions

Jesse Seymour talks about junk dimensions in warehousing:

I think one of the single biggest challenges I face as I attempt to warehouse data that originates as a SharePoint list is the handful of miscellaneous descriptive fields, such as approval status, request status, or something similar.  Typically, this fields are setup as Choice fields in the SharePoint list so they have a known range of values, but its still a pain to have to build a dimension for each one.

Enter the junk dimension.  Ever since I learned about this concept it has made my life so much easier.  What the junk dimension does is perform a cross join against the different fields and creates a row for every possible combination of fields.

Junk dimensions are nice for those low-cardinality attributes which are important but don’t really fit anywhere else.  The important thing to remember about a junk dimension is that you don’t want it to be too large:  if you have 5 attributes, each of which has 8 possible values, you have 8^5 (32,768) rows.  That’s not so bad, but make it 10 attributes and now your table has 1,073,741,824 rows, and that’s a lot of rows for a single dimension.  If you find yourself in that scenario, you might want to create two junk dimensions (bringing you back to 2 dimensions with 32K rows), review your design to see if all those attributes are necessary, or review your design to see if your “junk” dimension is hiding a real dimension.

Logical Data Models

Kevin Kline discusses logical data modeling:

In a recent blog post entitled Is Logical Data Modeling Dead?, Karen Lopez (b | t) comments on the trends in the data modeling discipline and shares her own processes and preferences for logical data modeling (LDM). Her key point is that LDMs are on the decline primarily because they (and their creators) have failed to adapt to changing development processes and trends.

I love all things data modeling. I found data models to be a soothing and reassuring roadmap that underpinned the requirements analysis and spec writing of the Dev team, as well as a supremely informative artifact of the Dev process which I would constantly refer to when writing new T-SQL code and performing maintenance. However, as time has passed, I have been surprised by how far it has fallen out of favor.

This is an interesting discussion.  I’m not sure I’ve ever created a true logical data model.  I’ve worked with systems which could potentially take advantage of them, but they never hit the top of the priority list.

Choosing Between Optimistic Concurrency Levels

Kendra Little has a cheat sheet for comparing the two optimistic concurrency levels:

SQL Server offers two flavors of optimistic locking for traditional disk-based tables: Read Committed Snapshot Isolation (RCSI), and Snapshot Isolation. They are each great tools to reduce blocking and make applications faster, particularly in databases that have lots of tiny reads and writes that need to be quick, or mixed-use patterns (lots of little reads and writes + larger reporting queries).

Both isolation levels are available in SQL Server Standard Edition and Web Edition as well as Enterprise. That’s right, they’re cheap and easy. Each of them are controlled by database level settings, and default to disabled when you install SQL Server.

The moral of the story:  both of these are awesome, both have potential drawbacks, and both need testing.  I’ve had good experiences with RCSI, but even then, maybe about 1% of procedures need specific locking hints (either NOLOCK or an explicit lock) to maintain previous application behavior and to deal with the problem Kendra brought up.  Moral of the story:  test, test, test.

DR On The Cheap

Derik Hammer’s final Availability Group architecture post covers disaster recovery on the cheap:

This architecture can be used when your organization does not value their secondary data center the same as the primary. It is a best practice to have matched or similar hardware between your primary and disaster recovery sites but that is not always possible. When costs need to be reduced it is better to have one failover server that you know can handle the work load rather than two servers which are under powered. Under powered hardware can easily become an effective outage if they cause timeouts as soon as a production work load is placed on them.

For many small to medium-sized enterprises, this might be the easiest route to sell to management—it’s hard to get management to have “redundant” servers which normally don’t get used.

Online Database Modelling Services

Robert Sheldon has a look at some web-based database modelling services:

Here I look at five tools that provide online modeling services, ranging from commercial products to free, open-source solutions. They include Vertabelo, GenMyModel, dbDiffo, WWW SQL Designer, and DbDesigner.net. Each one takes a different approach to delivering its services and completing basic tasks, such as adding tables, columns, or relationships. You’ll find that some tools are more feature-rich and user-friendly than others, but each one has its own advantages and charms.

That said, none of the tools provide the level of functionality you get with an advanced on-premises solution such as PowerDesigner or ER/Studio, but not everyone needs such an extensive set of features—or the price tag that goes with them. In fact, all five online tools come either completely free or have a free version available, making it possible to try all of them without having to commit one way or the other. Not surprisingly, the free versions associated with the commercial products have limitations on their use.

The hat I’d throw into the ring is draw.io.  It’s a Visio look-alike, so it’s more useful for high-level strategic diagrams than a true model.

Availability Groups And Failover Cluster Instances

Derik Hammer digs into a more complex architecture for HA+DR:

The Availability Group will handle the data synchronization between the data centers. Asynchronous mode is recommend due to potential network latency hindering primary site performance. With the combination of these two features, you meet HA with the FCI’s automatic failover locally and DR with manual failover of the Availability Group between sites.

Derik’s conclusion is important:  this isn’t the type of thing a brand new accidental DBA should try to build; there are a lot of moving parts here.  But if you want a robust solution and have the support people available to handle the relative complexity, this is a good option.

The Value Of Sparse Columns

Erin Stellato discusses sparse columns:

In conclusion, we see a significant reduction in disk space and IO when sparse columns are used, and they perform slightly better than non-sparse columns in our simple data modification tests (note that retrieval performance should also be considered; perhaps the subject of another post).

Sparse columns have a lot of potential value, but in my experience, they fall short in one huge way:  you cannot compress tables with sparse columns.  Given that both sparse columns and data compression are things which benefit from scale, it’s important to make the right choice upfront.

Setting Up A Scrub Server

Tom Norman talks about setting up a data scrubbing server:

With that said, when I setup a Scrub server, it is in a very secure area where the data access is very, very limited. For example, in my current company, the server is in a separate domain from Production and QA/Dev. Only DBAs are allowed to access this server. If you have multiple DBAs at your location, you may want to even limit which DBAs have access to this server. Our goal is to automate the entire scrubbing process so no one has to access the data including copying backup files from Production and to a shared scrub location for QA/Dev to retrieve.

Scrub servers are a way of stripping personally identifiable or sensitive information from production data so developers can safely use the data in lower, less secure environments.

Options To Capture Changed Data

Koen Verbeeck looks at various ways of capturing changed data:

  • In some very rare cases, you can actually use change data capture or change tracking on the source system. If you get one of those features implemented, you’re golden. But most of the time you’re not, as a lot of administrators don’t like them because of potential performance impact.

Koen lists several options.  One additional option is to use triggers to capture changes in a queue table.  If you are dealing with SCD-1 changes (in which you do not need a full reckoning of history) or periodic SCD-2 (in which you keep history but are okay with smashing some changes together if they’re within a time period between ETL loads), loading IDs of changed records into a queue table is reasonably efficient and gets around trying to make sure everybody updates the modified date.  It has its own drawbacks, though, starting with it using triggers…

Dropping Unique Key Constraints

Kenneth Fisher doesn’t like the way SQL Server implements unique key constraints:

So now that we all know the basics what could possibly have gone wrong? Well I was handed an error.

Msg 3723, Level 16, State 5, Line 21
An explicit DROP INDEX is not allowed on index ‘TblUniqueConstraint.uni_TblUniqueConstraint’. It is being used for UNIQUE KEY constraint enforcement.

Someone had created a process several years ago that dropped and re-created indexes (I’m not going to go into why right now). Well this particular index is used to enforce a unique constraint and so it can’t be dropped. If you want to follow along here is some quick code to duplicate the problem.

The appropriate way to drop a unique key constraint is ALTER TABLE [TableName] DROP CONSTRAINT [ConstraintName].

I disagree with Kenneth that there’s no value in unique key constraints (I’m guessing implicit in here is “in comparison to using CREATE UNIQUE NONCLUSTERED INDEX” syntax).  There’s a semantic difference between an index which happens to be unique versus a unique key constraint.  They’re implemented very similarly, but the point of the latter is to tell anybody using the data model that this set of attributes must be unique.

Categories

June 2017
MTWTFSS
« May  
 1234
567891011
12131415161718
19202122232425
2627282930