Press "Enter" to skip to content

Category: Warehousing

Lambda Architecture

Koos van Strien looks at lambda architecture and asks if it works for data warehouses:

The Lambda Architecture is pretty well documented – online1 as well as in the book I just mentioned2. For a quick overview, Lambda Architecture is basically a system where the raw data is always stored, and never thrown away. All information that’s derived from this raw data is always recomputed – often stated as query = function(all data). This provides for a fool-proof architecture that’s rigorously simple (compared to classic RDBMS solutions), made up of three layers:

Admittedly, about half of this went over my head, but there are some good book and webpage recommendations to learn more about lambda architecture and Data Vault.

Comments closed

Azure SQL Data Wareouse

Robert Sheldon has a getting started guide for Azure SQL Data Warehouse:

In my Simple-Talk article Azure SQL Data Warehouse, I introduced you to SQL Data Warehouse and gave you an overview of the architecture and technologies that drive the service and make it all work. In this article, I go a step further and provide details about getting started with SQL Data Warehouse, demonstrating how to add a sample database and then accessing the server and database settings.

If you want to follow along with my examples and try out SQL Data Warehouse for yourself, you must have an active Azure subscription, even if it’s just the free trial. For those who have already used up their free trial, be aware that SQL Data Warehouse is a pay-as-you-go service, even though it’s still in preview, so unless you’re on an unlimited company budget or happen to have accrued MSDN credits, you’ll want to be judicious in how you try out the service. Fortunately, as you’ll see in this article, you can pause the compute resources when not in use, helping to minimize the costs associated with learning about the service.

This article is all about initial installation and configuration.

Comments closed

Junk Dimensions

Jesse Seymour talks about junk dimensions in warehousing:

I think one of the single biggest challenges I face as I attempt to warehouse data that originates as a SharePoint list is the handful of miscellaneous descriptive fields, such as approval status, request status, or something similar.  Typically, this fields are setup as Choice fields in the SharePoint list so they have a known range of values, but its still a pain to have to build a dimension for each one.

Enter the junk dimension.  Ever since I learned about this concept it has made my life so much easier.  What the junk dimension does is perform a cross join against the different fields and creates a row for every possible combination of fields.

Junk dimensions are nice for those low-cardinality attributes which are important but don’t really fit anywhere else.  The important thing to remember about a junk dimension is that you don’t want it to be too large:  if you have 5 attributes, each of which has 8 possible values, you have 8^5 (32,768) rows.  That’s not so bad, but make it 10 attributes and now your table has 1,073,741,824 rows, and that’s a lot of rows for a single dimension.  If you find yourself in that scenario, you might want to create two junk dimensions (bringing you back to 2 dimensions with 32K rows), review your design to see if all those attributes are necessary, or review your design to see if your “junk” dimension is hiding a real dimension.

Comments closed

Type 2 SCDs With Biml

Meagan Longoria has a great post on Type 2 Slowly Changing Dimensions:

The most common mistake I see in SCD 2 packages, whether using the built-in transformation or creating your own data flow, is that people use OLEDB commands to perform updates one row at a time rather than writing updates to a staging table and performing a set-based update on all rows.  If your dimension is small, the performance from row by row updates may be acceptable, but the overhead associated with using a staging table and performing set-based update will probably be negligible. So why not keep a consistent pattern for all type 2 dimensions and require no changes if the dimension grows?

Spot on.

Comments closed

The Logical Data Warehouse

Robert Sheldon is looking beyond the Enterprise Data Warehouse:

Organizations looking to take control of this onslaught of information are turning to other solutions to meet their data needs, either in addition to or instead of the traditional EDW. Quite often this means turning to a logical architecture that abstracts the inherent complexities of the big data universe. Such an approach embraces mixed environments through the use of distributed processing, data virtualization, metadata management, and other technologies that help ease the pain of accessing and federating data.

Dubbed the logical data warehouse (LDW), this virtual approach to a BI analytics infrastructure originated with Mark Beyer, when participating in Gartner’s Big Data, Extreme Information and Information Capabilities Framework research in 2011. According to his blog post “ Mark Beyer, Father of the Logical Data Warehouse, Guest Post ,” Beyer believes that the way to approach analytical data is to focus on the logic of the information, rather than the mechanics:

This feels like something that first-movers are starting to adopt, but won’t be mainstream for another 6-8 years.  That should give the idea some time to mature as we see the first round of successes and (more importantly) failures.

Comments closed