Modern Data Warehouse Dictionary

Melissa Coates has put together a glossary of terms for modern data warehousing:

Logical Data Warehouse

A logical data warehouse (LDW) builds upon the traditional DW by providing unified data access to multiple platforms. Conceptually, the logical data warehouse is a view layer that abstractly accesses distributed systems such as relational DBs, NoSQL DBs, data lakes, in-memory data structures, and so forth, consolidating and relating the data in a virtual layer. This availability of data on various platforms adds flexibility to a traditional DW, and speeds up data availability. The tradeoff for this flexibility can be slower performance for user queries, though the full-fledged LDW vendors employ an array of optimization techniques to mitigate performance issues. A logical data warehouse is broader than just data virtualization and distributed processing which can be thought of as enabling technologies. According to Gartner a full-fledged LDW system also involves metadata management, repository management, taxonomy/ontology resolution, auditing & performance services, as well as service level agreement management.

If you’re just getting started with the topic, check this out, as it will probably clear up several concepts.

Where Azure Analysis Services Fits

Melissa Coates explains where Azure Analysis Services fits in common BI architectures:

(2) Data Sources

  • From a single source such as a data warehouse. This is the most traditional path for BI development, and still has a very valid place in many BI/analytics deployments. This scenario puts the work of data integration on the ETL process into the data warehouse, which is the most appropriate place.

  • Directly from various systems.  This can be done, but works well only in specific cases – it definitely won’t work well if there are a lot of highly normalized tables, or if there’s not a straightforward way to relate the disparate data together. Trying to go directly to the source systems & skip an intermediary data warehouse puts the “integration” burden on the data source view in Analysis Services, so plan for plenty of time testing if you’re going to try this route (i.e., it can be much harder, not easier). Note that this option only makes sense if the data is stored in Analysis Services because it needs to be related together somehow (i.e., DirectQuery mode, discussed next in #3, with > 1 data source won’t work if a user tries to combine data sources because the data is not inherently related).

If you’re thinking about Azure Analysis Services, this post is a good one.

A T-SQL Date Dimension

Vladimir Oselsky builds a date dimension in T-SQL:

Before we get into discussing how to create it date dimension and how to use it, first let’s talk about what it is and why do we need it. Depending on who you talk to, people can refer to this concept as “Calendar table” or “Date Dimension,” which is usually found in Data Warehouse. No matter how it is called, at the end of the day, it is a table in SQL Server which is populated with different date/calendar related information to help speed up SQL queries which require specific parts of dates.

In my case, I have created it to be able to aggregate data by quarters, years and month. Depending on how large your requirements are it will add additional complexity to building it. Since I don’t care about holidays (for now at least), I will not be creating holiday schedule which can be complicated to populate.

I love date dimensions, even on non-warehouse databases, because it’s an easy way of providing additional context to time series data.  Think about graphing orders per day in an industry with weekday-versus-weekend trends; a date dimension lets you strip out weekends (maybe plotting them separately) or even lets you build day-of-week analysis for each day, or looking at week of the month, etc.  You might also be interested in computing holidays.

Range-Based Dimensions

Jana Sattainathan has a couple blog posts on range dimensions.  First is durations:

The data is in increments of 300 seconds going from 0 to 31536000 seconds (1 year). So, this table can be used to analyze activities that take less than 1 year. The last row’s Dimension value should be used for everything that takes over one year (or you can generate more rows based on your need).

The second is size ranges:

In the middle there, one of the bar charts is “Backup Count & Duration by Size”. As the title says, this chart helps me determine which backups are small/large and determine how many backups are in each of those “Duration” buckets. The duration bucket that I used in this case could have been easily changed from GB ranges to TB ranges. For example, I filtered the chart to check counts of backups that are over 1 TB.  As one can see, I have a couple of databases that are in the 2.5 to 3 TB backup size range.

Often times, ranges are enough for analysis and that greater detail of a backup being 12.8 GB versus 12.81 GB obscures more useful information.

Semantic Layers

Melissa Coates explains the relevance of Analysis Services as a semantic layer:

Part 1: Why a Semantic Layer Like Azure Analysis Services is Relevant {you are here}

Part 2: Where Azure Analysis Services Fits Into BI & Analytics Architecture {coming soon}

Fundamentally, Analysis Services serves as a semantic layer (see below for further discussion of a semantic layer). Because the business intelligence industry now embraces an array of technology choices, sometimes it seems like a semantic layer is no longer valued like it once was. Well, my opinion is that for many businesses, a semantic layer is tremendously important to support the majority of business users who do *not* want to do their own data wrangling, data prep, and data modeling activities.

We (I) spend so much time thinking about the Brave New World of massive blobs of semi-structured data that it’s a good idea to step back every once in a while and remember that yes, there is a need for sanitized, easy-to-consume data which answers known business questions.  The percentage of people at a company willing to create an R or Python notebook or run a MapReduce job is typically well under 5%.

The Case For Self-Service BI

Matt Allington makes the case for self-service BI:

Success or failure of Enterprise BI can be shown as a continuum.

The 5 sample points I call out (from best to worst) are:

  1. It adds lots of value to lots of people.
  2. It’s OK, lots of “export to Excel”
  3. Some use, but not worth the cost
  4. It is a failure and it is written off
  5. It is a failure but you keep it.

Note what I list as the worst possible outcome.  The solution is no good, and no one does anything about it.  This is much worse than writing it off as a failure as you can’t move on if you don’t accept you have a problem.

This is a provocative article with some good comments.  I’ve mixed emotions about this, as I see Matt’s point and agree with him in the hypothetical scenario, but it’s really easy for business users to get the wrong answers from self-service tools (e.g., introducing hidden cartesian products or not applying all business rules to calculations) and give up on the product.  That might be a function of me doing it wrong and I’ll cop to that if so, but I think that self-service BI needs a “You must be this tall to ride” sign.

Row-Level Security With Reporting Services

Paul Turley discusses combining row-level security, SQL Server Reporting Services, and SQL Server Analysis Services:

In every data source connection string, you can add a simple expression that maps the current Windows username to the CUSTOMDATA property of the data source provider.  This works in SSRS embedded data sources, shared data sources, in a SharePoint Office Data Connecter (ODC) file and in a SharePoint BISM connection file.  In each case, the syntax should be the similar.  Here is my shared data source on the SSRS 2016 report server

This is pretty snazzy.  Paul goes into good detail on the topic, so read the whole thing.

Why Have A Date Dimension

Thomas LeBlanc discusses reasons for having a date dimension in a data warehouse:

The date dimension can also contain columns for Weekend versus Weekday, Holiday and month markers like 2014-10 or by quarter like 2014-Q1. All these can be computed once in the dimension table and used at will by query writers. They now do not have to know how to use T-SQL functions or concatenate substrings of “CASTed” date columns.

Then, when the DimDate is related to various Fact tables and processed into an OLAP cube, the measures and aggregations are displayable side by side through the DimDate dimension which is now considered a Conformed Dimension. The slicing and dicing of data has just been made a whole lot easier.

I’d go a step further and say that every instance should have access to a tally table and a date table.

Warehouses Will Live On

Jesse Seymour argues that in-memory analysis solutions will not entirely replace data warehouses:

The big reason that dimensional modeling increases clarity is that the dimensional model seeks to flatten data as much as possible.  Let’s compare two examples.  Both of these examples are for a fictional health clinic.

The first example is that we want a report on how many male patients were  treated with electric shock therapy by provider, grouped monthly and spanning year to date range.

Those big Kimball-style warehouses do a great job of making it easier for people who are not database specialists to query data and get meaningful, consistent results to known business questions.  The trick to understanding data platforms is that they tend to be complements rather than substitutes:  introducing Spark-R in your environment does not replace your Kimball-style warehouse; it complements it by letting analysts find trends more easily.  Similarly, a Hadoop cluster potentially lets you complement an existing data warehouse in a few ways:  acting as a data aggregator (which allows you to push some ETL work off onto the cluster), a data collector (especially for information which is useful but doesn’t really fit in your conformed warehouse), and a data processor (particularly for those gigantic queries which are not time-sensitive).

Early Metrics On Warehouse Performance

Sunil Agarwal shows some results from a sample workload indicating that SQL Server 2016 has improved two customers’ performance:

As part of SQL Server 2016 technology adoption program, during development, we work with many customers validating their production-like workload in a test environment and opportunistically take some of these workloads to production running on production-ready preview build.

In one such engagement, we worked with a customer in health industry who was running analytics workload on SYBASE IQ 15.4 system. Challenged by exponential data growth and the requirement for running analytics queries even faster for insights, the customer wanted to compare solutions from multiple vendors to see which analytical database could deliver the performance and features they need over the next 3-5 years. After extensive proof-of-concept projects, they concluded SQL Server 2016’s clustered columnstore delivered the best performance. The performance proof-of-concept tested the current database against Sybase IQ 16, MS SQL 2016, Oracle 12c, and SAP Hana using the central tables from the real-life data model filled with synthetic data in a cloud environment. MS SQL Server 2016 came out the clear winner. SAP Hana was second in performance, but also required much higher memory and displayed significant query performance outliers. Other contenders were out-performed by a factor of 2 or more.

Standard disclaimers apply:  your mileage may vary; we don’t get raw data; “all other things” are not necessarily equal.


January 2017
« Dec