Category: Warehousing

Almost lost in all the announcements from Ignite was a bunch of amazing new features that were added to the Provisioned Resources/SQL Pool section (read SQLDW functionalities) side of Azure Synapse Analytics (formally called Azure SQL Data Warehouse).

One of the more interesting options is ordered clustered columnstore indexes. That seems like something which would be nice to have on-prem. The segment elimination works on-prem today, but ordering is accidental at best. By that, I mean the way that SQL Server loads data into a CCI—roughly, in the order in which you insert it—is not guaranteed to work that way and could change in the future.

Comments closed

Databricks + Azure Synapse Analytics

Published 2019-11-11 by Kevin Feasel

David Meyer and Clinton Ford explain how you can integrate Azure Databricks with Azure Synapse Analytics:

In the last two years since it first became available, thousands of companies have adopted Azure Databricks, making it one of the fastest growing data and AI services on Microsoft Azure. Customers now process over 2 exabytes per month with millions of server-hours spinning up every day. All of this is driven by organizations like Electrolux, Shell, and renewables.AI that are using Azure Databricks to process data at massive scale for data science and analytics.
Within this amazing adoption is a specific solution architecture to highlight called the Modern Data Warehouse (MDW). Earlier this year we wrote about the performance and scale benefits of this solution, and part of the pattern’s success has been our close integration to Azure SQL Data Warehouse with a high-performance connector that was jointly engineered to make it fast and easy to move data between the two services.

Something interesting about Synapse is that its implementation of Spark is not the same as the Databricks implementation (perhaps for licensing reasons). But that doesn’t stop us from using Databricks to process and curate data for Synapse Analytics.

Comments closed

Azure Synapse Analytics, Nee Azure SQL DW

Published 2019-11-11 by Kevin Feasel

John Macintire explains Azure Synapse Analytics:

A cloud native, distributed SQL processing engine is at the foundation of Azure Synapse and is what enables the service to support the most demanding enterprise data warehousing workloads. This week at Ignite we introduced a number of exciting features to make data warehousing with Azure Synapse easier and allow organizations to use SQL for a broader set of analytics use cases.

There’s a fair amount of marketing-speak in here, but the gist is Azure SQL Data Warehouse + Spark + on-demand serverless queries (so you can, among other things, write T-SQL against your HDFS data). I think it has a better chance of long-lasting success than Azure SQL Data Warehouse.

Comments closed

Ordered Clustered Columnstore Indexes in Azure SQL DW

Published 2019-10-07 by Kevin Feasel

Niko Neugebauer takes us through a new feature in preview for Azure SQL Data Warehouse:

After creating (or dropping and recreating a Clustered Columnstore Index we can specify the reserved word ORDER and then one or !!!MULTIPLE!!! columns. This looks like an extremely promising feature!
On Azure SQL Data Warehouse one can of course define table as a Columnstore and with that specification it is also possible to define an ORDER option with one or multiple columns.
For the syntax and basic functionality testing purposes on Azure SQL Data Warehouse, let us then create a table with a Clustered Columnstore Index, load some data and see if by recreating an Ordered Clustered Columnstore Index we can achieve some improvements.

Niko has a few hard-earned lessons from this post.

Comments closed

Data Warehouse Concepts

Published 2019-09-27 by Kevin Feasel

Katrine Spirina takes us through several basic data warehousing concepts:

There are three basic types of modeling. Conceptual Data Model describes all entities a business needs information about. It provides facts about real-world things, customers, and other business-related objects and relations.
The goal of creating this data model is to synthesize and store all the data needed to gain an understanding of the whole business. The model is designed for the business audience.
Logical Data Model suits more in-depth data. It describes the structure of data elements, their attributes, and ways these elements interrelate. For instance, the model can be used to identify relationships between customers and products of interest for them. This model is characterized by a high level of clarity and accuracy.
Physical Data Model describes specific data and relationships needed for a particular case as well as the way data model is used in database implementation. It provides a wealth of meta-data and facilitates visualizing the structure of a database. Meta-data can involve accesses, limitations, indexes, and other features.

Click through for the whole story.

Comments closed

Modeling Semi-Additive Measures

Published 2019-09-16 by Kevin Feasel

Paul Poco shows a couple techniques for modeling semi-additive measures in Analysis Services and Power BI:

As mentioned earlier, the most commonly encountered approach is Option 2, the snapshot fact table. The main drawback of this approach is that the fact table’s size will grow extremely fast. For example, if you want to calculate the headcount in a company with 10,000 employees on average, and you want 5 years of historical data, you will add 10,000 rows per day to your fact table – that gives you (10,000 * 365 * 5 =) 18,250,000 rows after 5 years.
If you used the first approach, Option 1, the fact table would be (10,000 * 5 =) 50,000 rows after 5 years, assuming your employees change position or quit the company once a year, on average.
The snapshot fact table (Option 2) is (18,250,000 / 50,000 =) 365 times bigger. On the bright side, as the data is very repetitive, you might get a very good compression ratio on these tables.

Check it out. Semi-additive measures are not as common as additive measures, but you’re liable to have a couple of them in your data model.

Comments closed

Don’t Truncate Facts and Dimensions when Loading Data

Published 2019-07-26 by Kevin Feasel

Meagan Longoria explains why a truncate-and-reload strategy for data warehouses isn’t a good look:

Every once in a while, I come across a data warehouse where the data load uses a full truncate and reload pattern to populate a fact or dimension. While it may not be the end of the world for a small table, it does concern me and I usually recommend to redesign the load. My thoughts below on why this is an anti-pattern are true for using the actual TRUNCATE TABLE statement as well as executing a DELETE statement with no WHERE clause.

Read on for some great advice, including an exception to the rule.

Comments closed

SnowflakeDB: A Review

Published 2019-07-17 by Kevin Feasel

Achilleus gives us an overview of SnowflakeDB:

There is no dark magic involved in improving the efficiency of your queries. Based on whom you ask this can be considered as a standout feature or a major hindrance but I am not a fan of tuning queries according to my workload as I feel the way data evolves so quickly in organizations. It becomes more tricky to play catch up to turn the all necessary knobs to make the query faster.
Snowflake claims they tune all the queries “automagically” via a dynamic query optimization engine. No need for any indexes, updating statistics, partition keys or pre-shard data for even distribution when you scale up. All of this will be done by their patent-pending dynamic optimization.
But I still feel snowflake can work on providing the necessary knobs for people who would like to tune their queries.

Read the whole thing. In short, it’s a technology worth looking at, but it’s not going to work perfectly in all cases.

Comments closed

Querying Essbase from Power BI

Published 2019-07-03 by Kevin Feasel

Kellyn Pot’vin-Gorman shows how to query data from an Oracle Essbase cube in the Oracle Applications Cloud from Power BI:

The OAC environment that Opal gave me access possessed an example schema/data based on an Audio-Video store revenue for multiple years.
I’d never worked with the OAC before, but I was quickly able to find five methods to connect Power BI to it, either to import or to direct connect to the data:
1. Data Direct makes a Rest API to connect to the Oracle Cloud.
2. Use the web interface with Power BI and “scrape the website”, allowing PBI to build the table from example.
3. Connect with an ODBC driver
4. Via the OAC interface, export to Excel and then import into Power BI
5. Via the OAC interface, export to a table and then import into Power BI as a CSV file.

So, uh, yeah, you can do it. At least five ways.

Comments closed

Data Classifications on Azure SQL DW

Published 2019-05-31 by Kevin Feasel

Meagan Longoria takes us through data classifications on Azure SQL Data Warehouse:

Data classifications in Azure SQL DW entered public preview in March 2019. They allow you to label columns in your data warehouse with their information type and sensitivity level. There are built-in classifications, but you can also add custom classifications. This could be an important feature for auditing your storage and use of sensitive data as well as compliance with data regulations such as GDPR. You can export a report of all labeled columns, and you can see who is querying sensitive columns in your audit logs. The Azure Portal will even recommend classifications based upon your column names and data types. You can add the recommended classifications with a simple click of a button.

But read the whole thing, as Meagan sees a problem with it when you use a popular loading technique.

Comments closed