Star-Schema Benchmark With Hive + Druid

Carter Shanklin and Slim Bouguerra run a Hadoop OLAP system running Hive and Druid against the Star-Schema Benchmark battery of queries:

How did we arrive at the query used to build the OLAP index? There is a systematic procedure:

  1. The union of all dimensions used by the SSB queries is included in the index.
  2. The union of all measures is included in the index. Notice that we pre-compute some products in the index.
  3. Druid requires a timestamp, so the date of the transaction is used as the timestamp.

You can see that building the index requires knowledge of the query patterns. Either an expert in the query patterns architects the index, or a tool is needed to analyze queries or to dynamically build indexes on the fly. A lot of time can be spent in this architecture phase, gathering requirements, designing measures and so on, because changing your mind after-the-fact can be very difficult.

One thing I don’t like so much is that they removed the ORDER BY clauses from some of the queries, as making this change makes it more difficult to use these results for “it’s totally not a comparison so don’t sue us Oracle” purposes.

Azure SQL DW Reference Architectures

James Serra shows how Azure SQL Data Warehouse can fit into various warehousing architectures:

Do staging, data refinement and reporting all from SQL DW.  You can scale compute power up when needed (i.e. during staging, data refinement, or large number of users doing reporting) or down to save costs (i.e. nights and weekends when user reporting is low).  The pros of this option are by reducing the number of technologies you are building a simpler solution and reducing the number of copies of the data.  The cons are since everything is done on SQL DW you can have performance issues (i.e. doing data refinement while users are reporting), can hit the SQL DW concurrent query limit, and can have a higher cost since SQL DW is the highest-cost product, especially if you are not able to pause it.  Pausing it reduces your cost to zero for compute, only having to pay for storage (see Azure SQL Data Warehouse pricing), but no one can use SQL DW when paused

Click through for three other architecture ideas.

Loading Azure SQL DW: Embrace The CTAS

John Hoang, Joe Sack, and Martin Lee explain different loading patterns for loading Azure SQL Data Warehouse:

SQL Data Warehouse supports many loading methods, including SSIS, BCP, the SQLBulkCopy API, and Azure Data Factory (ADF). These methods all share a common pattern for data ingestion. By comparison, the PolyBase technology uses a different approach that provides better performance.

PolyBase is by far the fastest and most scalable SQL Data Warehouse loading method to date, so we recommend it as your default loading mechanism. PolyBase is a scalable, query processing framework compatible with Transact-SQL that can be used to combine and bridge data across relational database management systems, Azure Blob Storage, Azure Data Lake Store and Hadoop database platform ecosystems (APS only).

If you’re curious about Azure SQL Data Warehouse, this is a must-read.

Designing A Data Warehouse Test Plan

Koos van Strien walks through some of the high-level concepts when automating data warehouse tests:

In my current project, I’ve got a database containing everything to perform these tests:

  • Tables with identical structure to the ones in the staging area (plus two columns “TestSuiteName” and “TestName”)
  • A table containing the mapping from test-input table to target database, schema and table
  • A stored procedure to purge the DWH (all layers) in the test environment
  • A stored procedure to insert the data for a specific testsuite / name

When preparing a specific test case (the “insert rows for test case” step from the diagram above), the rows needed for that case are copied into the DWH:

Testing warehouses is certainly not a trivial exercise but given how complex warehouse ETL tends to be, having good tests reduces the number of 3 AM pages.

Azure SQL Data Warehouse Max Size Property

Joey D’Antoni says not t freak out if you look at the Max Size property on an Azure SQL Data Warehouse database:

One of the promises of Azure SQL Data Warehouse is the ability to have petabyte scale. The ability to quickly scale data, and have that data scale independently of compute resources. So when one I my clients emailed me yesterday with this screenshot, needless to say I was concerned.

As you can see, when  the properties screen shows a max size of 5 GB.

Click through for the reason why.

Integrating Data Lake Storage With SQL Data Warehouse

Sachin Sheth alerts us to a new integration point between Azure Data Lake Storage and Azure SQL Data Warehouse via Polybase:

Most common patterns using Azure Data Lake Store (ADLS) involve customers ingesting and storing raw data into ADLS. This data is then cooked and prepared by analytic workloads like Azure Data Lake Analytics and HDInsight. Once cooked this data is then explored using engines like Azure SQL Data Warehouse. One key pain point for customers is having to wait for a substantial time after the data was cooked to be able to explore it and gather insights. This was because the data stored in ADLS would have to be loaded into SQL Data Warehouse using tools row-by-row insertion. But now, you don’t have to wait that long anymore. With the new SQL Data Warehouse PolyBase support for ADLS, you will now be able to load and access the cooked data rapidly and lessen your time to start performing interactive analytics. PolyBase support will allow to you access unstructured/semi-structured files in ADLS faster because of a highly scalable loading design. You can load the files stored in ADLS into SQL Data Warehouse to perform analytics with fast response times or you use can the files in ADLS as external tables. So get ready to unlock the value stored in your petabytes of data stored in ADLS.

I’ve been waiting for this support, and I’m happy that they were able to integrate the two products.

Automating Stats Maintenance With Azure SQL DW

Grant Fritchey shows how to create automated statistics maintenance for an Azure SQL Data Warehouse database:

NOTE: The most important habit you can start with in Azure is putting everything into discrete, planned, Resource Groups. These make management so much easier.

Once the account is set, the first thing you need is to create a Runbook. There is a collection of them for your use within Azure. None of them are immediately applicable for what I need. I’m just writing a really simple Powershell script to do what I want:

Runbooks are an important part of Azure maintenance, and this is a gentle introduction to them.

Modern Data Warehouse Dictionary

Melissa Coates has put together a glossary of terms for modern data warehousing:

Logical Data Warehouse

A logical data warehouse (LDW) builds upon the traditional DW by providing unified data access to multiple platforms. Conceptually, the logical data warehouse is a view layer that abstractly accesses distributed systems such as relational DBs, NoSQL DBs, data lakes, in-memory data structures, and so forth, consolidating and relating the data in a virtual layer. This availability of data on various platforms adds flexibility to a traditional DW, and speeds up data availability. The tradeoff for this flexibility can be slower performance for user queries, though the full-fledged LDW vendors employ an array of optimization techniques to mitigate performance issues. A logical data warehouse is broader than just data virtualization and distributed processing which can be thought of as enabling technologies. According to Gartner a full-fledged LDW system also involves metadata management, repository management, taxonomy/ontology resolution, auditing & performance services, as well as service level agreement management.

If you’re just getting started with the topic, check this out, as it will probably clear up several concepts.

Free Trial Of Azure SQL Data Warehouse

James Serra notes that there is a free one-month trial of Azure SQL Data Warehouse:

You can use this one month free trial to do POCs and try out SQL DW up to 200 DWU and 2TB of data.  You must sign up by December 31st 2016.  Please note that once the one month free trial is over, you will start getting billed at general availability pricing rates.  For more information on the free trial, and to sign up, go here.

This is great because you can quickly run out of credits otherwise.

OLAP On Hadoop

Tim Spann discusses OLAP options on the Hadoop stack:

Apache Kylin

For an introduction to this interesting Hadoop project, check out this article.   Apache Kylin originally from eBay, is a Distributed Analytics Engine that provides SQL and OLAP access to Hadoop datasets utilizing Hive and HBase.   It can use called through SparkSQL as well making for a very useful project.   This project let’s you work with PowerBI, Tableau and Excel with more tool support coming soon.    You can doMOLAP cubes and support many users with fast queries over billions of rows.   Apache Kylin provides JDBC and ODBC drivers.

There are a few interesting options here.


June 2017
« May