Data Warehouse Automation

Koos van Strien provides some thoughts on data warehouse automation tools:

Currently, I think there are two main approaches to Data Warehouse Automation

  1. Data Warehouse Generation: You provide sources, mappings, datatype mappings etc.. The tool generates code (or artifacts).
  2. Data Warehouse Automation (DWA): The tool not only generates code / artifacts, but also manages the existing Data Warehouse, by offering continuous insight in data flows, actual lineage, row numbers, etc..

The difference might seem small, but IMHO is visible most clearly whenever changes occur in the Data Warehouse – the second class of tools can handle those changes (while preserving history). With the first class of tools provide you with the new structures, but you need to handle the preservation of history yourself (as you would’ve without DWA).

Read on for a contrast of these two approaches.

Why Hadoop BI Projects Fail

Remy Rosenbaum lays out several reasons why he’s seen business intelligence projects on Hadoop fail:

In order to set up and run an effective Big Data Hadoop project that provides reliable BI, your organization will need to adopt a new mindset that addresses not only the technology, but also the organizational EIM. You will need to conduct a comprehensive analysis of your business with the help of analysts, internal domain experts, and strategists to come up with robust and relevant business use cases. You will also need buy-in from management, and take company politics into consideration.

Your Big Data project needs to work with your existing BI tools, along with your security and monitoring systems. Data security needs to be addressed because standard Hadoop implementations have relatively poor security, and many organizations are wary of keeping all their data in one location.

I do agree with these reasons, though I’m a bit surprised that I didn’t see much about “classic” BI problems like the inability of the company to standardize on terminology or definitions (e.g., what the Kimball method describes as conformed dimensions), the desire to tackle too much of the problem at once, rapidly-changing source systems (and how BI team members tend to be the last to know that something has changed), etc.

Rolling Out An Analytics Project

Christina Prevalsky shares some thoughts on considerations when implementing an analytics project:

The earlier you address data quality the better; the less time your end users spend on data wrangling, and the more they can focus on high value analytics. As your organization’s data infrastructure matures, migrating from spreadsheets to databases and data warehouses, data quality checks should be formally defined, documented, and automated. Exceptions should either be handled automatically during data intake using predefined business rules logic or require immediate user intervention to correct any errors.

Providing clean, centralized, and analytics-ready data to end users should not be a one-way process. By allowing end users to focus on high-value analytics, like data mining, network graphs, clustering, etc., they can uncover certain outliers and anomalies in the data. Effective data management should include a feedback loop to communicate these findings and, if necessary, incorporate any changes in the ETL processes, making centralized data management more dynamic and flexible.

The big question to ask is, “what problem are we trying to solve?”  That will help determine the answer to many of the questions, including how you store the data, how you expose the data, and even which data you collect and keep.

Modern Data Warehouse Dictionary

Melissa Coates has put together a glossary of terms for modern data warehousing:

Logical Data Warehouse

A logical data warehouse (LDW) builds upon the traditional DW by providing unified data access to multiple platforms. Conceptually, the logical data warehouse is a view layer that abstractly accesses distributed systems such as relational DBs, NoSQL DBs, data lakes, in-memory data structures, and so forth, consolidating and relating the data in a virtual layer. This availability of data on various platforms adds flexibility to a traditional DW, and speeds up data availability. The tradeoff for this flexibility can be slower performance for user queries, though the full-fledged LDW vendors employ an array of optimization techniques to mitigate performance issues. A logical data warehouse is broader than just data virtualization and distributed processing which can be thought of as enabling technologies. According to Gartner a full-fledged LDW system also involves metadata management, repository management, taxonomy/ontology resolution, auditing & performance services, as well as service level agreement management.

If you’re just getting started with the topic, check this out, as it will probably clear up several concepts.

Where Azure Analysis Services Fits

Melissa Coates explains where Azure Analysis Services fits in common BI architectures:

(2) Data Sources

  • From a single source such as a data warehouse. This is the most traditional path for BI development, and still has a very valid place in many BI/analytics deployments. This scenario puts the work of data integration on the ETL process into the data warehouse, which is the most appropriate place.

  • Directly from various systems.  This can be done, but works well only in specific cases – it definitely won’t work well if there are a lot of highly normalized tables, or if there’s not a straightforward way to relate the disparate data together. Trying to go directly to the source systems & skip an intermediary data warehouse puts the “integration” burden on the data source view in Analysis Services, so plan for plenty of time testing if you’re going to try this route (i.e., it can be much harder, not easier). Note that this option only makes sense if the data is stored in Analysis Services because it needs to be related together somehow (i.e., DirectQuery mode, discussed next in #3, with > 1 data source won’t work if a user tries to combine data sources because the data is not inherently related).

If you’re thinking about Azure Analysis Services, this post is a good one.

A T-SQL Date Dimension

Vladimir Oselsky builds a date dimension in T-SQL:

Before we get into discussing how to create it date dimension and how to use it, first let’s talk about what it is and why do we need it. Depending on who you talk to, people can refer to this concept as “Calendar table” or “Date Dimension,” which is usually found in Data Warehouse. No matter how it is called, at the end of the day, it is a table in SQL Server which is populated with different date/calendar related information to help speed up SQL queries which require specific parts of dates.

In my case, I have created it to be able to aggregate data by quarters, years and month. Depending on how large your requirements are it will add additional complexity to building it. Since I don’t care about holidays (for now at least), I will not be creating holiday schedule which can be complicated to populate.

I love date dimensions, even on non-warehouse databases, because it’s an easy way of providing additional context to time series data.  Think about graphing orders per day in an industry with weekday-versus-weekend trends; a date dimension lets you strip out weekends (maybe plotting them separately) or even lets you build day-of-week analysis for each day, or looking at week of the month, etc.  You might also be interested in computing holidays.

Range-Based Dimensions

Jana Sattainathan has a couple blog posts on range dimensions.  First is durations:

The data is in increments of 300 seconds going from 0 to 31536000 seconds (1 year). So, this table can be used to analyze activities that take less than 1 year. The last row’s Dimension value should be used for everything that takes over one year (or you can generate more rows based on your need).

The second is size ranges:

In the middle there, one of the bar charts is “Backup Count & Duration by Size”. As the title says, this chart helps me determine which backups are small/large and determine how many backups are in each of those “Duration” buckets. The duration bucket that I used in this case could have been easily changed from GB ranges to TB ranges. For example, I filtered the chart to check counts of backups that are over 1 TB.  As one can see, I have a couple of databases that are in the 2.5 to 3 TB backup size range.

Often times, ranges are enough for analysis and that greater detail of a backup being 12.8 GB versus 12.81 GB obscures more useful information.

Semantic Layers

Melissa Coates explains the relevance of Analysis Services as a semantic layer:

Part 1: Why a Semantic Layer Like Azure Analysis Services is Relevant {you are here}

Part 2: Where Azure Analysis Services Fits Into BI & Analytics Architecture {coming soon}

Fundamentally, Analysis Services serves as a semantic layer (see below for further discussion of a semantic layer). Because the business intelligence industry now embraces an array of technology choices, sometimes it seems like a semantic layer is no longer valued like it once was. Well, my opinion is that for many businesses, a semantic layer is tremendously important to support the majority of business users who do *not* want to do their own data wrangling, data prep, and data modeling activities.

We (I) spend so much time thinking about the Brave New World of massive blobs of semi-structured data that it’s a good idea to step back every once in a while and remember that yes, there is a need for sanitized, easy-to-consume data which answers known business questions.  The percentage of people at a company willing to create an R or Python notebook or run a MapReduce job is typically well under 5%.

The Case For Self-Service BI

Matt Allington makes the case for self-service BI:

Success or failure of Enterprise BI can be shown as a continuum.

The 5 sample points I call out (from best to worst) are:

  1. It adds lots of value to lots of people.
  2. It’s OK, lots of “export to Excel”
  3. Some use, but not worth the cost
  4. It is a failure and it is written off
  5. It is a failure but you keep it.

Note what I list as the worst possible outcome.  The solution is no good, and no one does anything about it.  This is much worse than writing it off as a failure as you can’t move on if you don’t accept you have a problem.

This is a provocative article with some good comments.  I’ve mixed emotions about this, as I see Matt’s point and agree with him in the hypothetical scenario, but it’s really easy for business users to get the wrong answers from self-service tools (e.g., introducing hidden cartesian products or not applying all business rules to calculations) and give up on the product.  That might be a function of me doing it wrong and I’ll cop to that if so, but I think that self-service BI needs a “You must be this tall to ride” sign.

Row-Level Security With Reporting Services

Paul Turley discusses combining row-level security, SQL Server Reporting Services, and SQL Server Analysis Services:

In every data source connection string, you can add a simple expression that maps the current Windows username to the CUSTOMDATA property of the data source provider.  This works in SSRS embedded data sources, shared data sources, in a SharePoint Office Data Connecter (ODC) file and in a SharePoint BISM connection file.  In each case, the syntax should be the similar.  Here is my shared data source on the SSRS 2016 report server

This is pretty snazzy.  Paul goes into good detail on the topic, so read the whole thing.

Categories

August 2017
MTWTFSS
« Jul  
 123456
78910111213
14151617181920
21222324252627
28293031