Junk Dimensions

Jesse Seymour talks about junk dimensions in warehousing:

I think one of the single biggest challenges I face as I attempt to warehouse data that originates as a SharePoint list is the handful of miscellaneous descriptive fields, such as approval status, request status, or something similar.  Typically, this fields are setup as Choice fields in the SharePoint list so they have a known range of values, but its still a pain to have to build a dimension for each one.

Enter the junk dimension.  Ever since I learned about this concept it has made my life so much easier.  What the junk dimension does is perform a cross join against the different fields and creates a row for every possible combination of fields.

Junk dimensions are nice for those low-cardinality attributes which are important but don’t really fit anywhere else.  The important thing to remember about a junk dimension is that you don’t want it to be too large:  if you have 5 attributes, each of which has 8 possible values, you have 8^5 (32,768) rows.  That’s not so bad, but make it 10 attributes and now your table has 1,073,741,824 rows, and that’s a lot of rows for a single dimension.  If you find yourself in that scenario, you might want to create two junk dimensions (bringing you back to 2 dimensions with 32K rows), review your design to see if all those attributes are necessary, or review your design to see if your “junk” dimension is hiding a real dimension.

Related Posts

Machine Learning and Delta Lake

Brenner Heintz and Denny Lee walk us through solving data engineering problems with Delta Lake: As a result, companies tend to have a lot of raw, unstructured data that they’ve collected from various sources sitting stagnant in data lakes. Without a way to reliably combine historical data with real-time streaming data, and add structure to […]

Read More

Don’t Truncate Facts and Dimensions when Loading Data

Meagan Longoria explains why a truncate-and-reload strategy for data warehouses isn’t a good look: Every once in a while, I come across a data warehouse where the data load uses a full truncate and reload pattern to populate a fact or dimension. While it may not be the end of the world for a small […]

Read More

Categories

February 2016
MTWTFSS
« Jan Mar »
1234567
891011121314
15161718192021
22232425262728
29