Category: Warehousing

Data Lineage and SSIS

Published 2020-09-04 by Kevin Feasel

Aveek Das has a two-parter. First up is a discussion of data lineage:

In this article, I am going to explain what Data Lineage in ETL is and how to implement the same. In this modern world, where companies are dealing with a humongous amount of data every day, there also lies a challenge to efficiently manage and monitor this data. There are systems that generate data every second and are being processed to a final reporting or monitoring tool for analysis. In order to process this data, we use a variety of ETL tools, which in turn makes the data transformation possible in a managed way.
While transforming the data in the ETL pipeline, it has to go through multiple steps of transformations in order to achieve the final result. For example, when the ETL receives the raw data from the source, there may be operations applied to it like filtering, sorting, merging, or splitting two columns, etc. There can also be aggregations or other calculations made on this raw data before finally moving into a data warehouse or preparing it for reporting. In order to be able to detect what the source of a particular record is, we need to implement something known as Data Lineage. It is a piece of simple metadata information that helps us detect gaps in the data processing pipeline and enables us to fix issues later.

Part two covers data lineage with SQL Server Integration Services:

In this article, I am going to discuss SSIS data lineage concepts, which are often used while designing ETL workloads on a data warehouse. Although this article is focused on implementing data lineage using SSIS, it does not only confine to SSIS but to any ETL tools in the market using which data is moved from one source to a destination. In my previous article, Understanding Data Lineage in ETL, I have already discussed the generic importance of data lineage concepts for any ETL tool. I would definitely suggest you have a look at it if you want to understand in general how data lineage helps to track the source of a single record in the warehouse.

If you’re fairly new to this world, it’s a good introduction to an important topic.

Comments closed

A Review of Azure Synapse Analytics

Published 2020-08-24 by Kevin Feasel

Teo Lachev looks at Azure Synapse Analytics:

There is plenty to like in Azure Synapse which is the evaluation of Azure SQL DW. If you’re tasked to implement a cloud-based data warehouse, you have a choice among three Azure SQL Server-based PaaS offerings, including Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse. In a nutshell, Azure SQL Database and Azure SQL MI are optimized for OLTP workloads. For example, they have full logging enabled and replicate each transaction across replicas. Full logging is usually a no-no for decent size DW workloads because of the massive ETL changes involved.
In addition, to achieve good performance, you’ll find yourself moving up the performance tiers and toward the price point of the lower Azure Synapse SKUs. Not to mention that unlike Azure SQL Database, Azure Synapse can be paused, such as when reports hit a semantic layer instead of DW, and this may offer additional cost cutting options.

Teo focuses primarily on SQL Pools and the more SQL-friendly side of things (ELT and Power BI) rather than Spark pools.

Comments closed

SQL Serverless in Azure Synapse Analytics

Published 2020-08-11 by Kevin Feasel

James Serra talks to us about SQL serverless (presently known as SQL on-demand but I’m getting ahead of the marketing curve this time):

Querying data in ADLS Gen2 storage using T-SQL is made easy because of the OPENROWSET function with additional capabilities (check out the T-SQL that is supported). The currently supported file types in ADLS Gen2 that SQL-on-demand can use are Parquet, CSV, and JSON. ParquetDirect and CSV 2.0 add performance improvements (see Benchmarking Azure Synapse Analytics – SQL Serverless, using .NET Interactive). You can also query folders and multiple files and use file metadata in queries.

Read on to learn a lot more about its use cases.

Comments closed

Historical Dimensions in a Kimball-Style Model

Published 2020-08-07 by Kevin Feasel

Vince Iacoboni takes a stab at improving the Kimball model:

We owe a lot to Ralph Kimball and friends. His practical warehouse design and conformed-dimension bus architecture are the industry standard. Business users can understand and query these warehouses directly and gain valuable insights into the business. Kimball’s practical approach focuses squarely on clarity and ease of use for the business users of the warehouse. Kudos to you and yours, Mr. Kimball.
That said, can the mainstay Type 2 slowly changing dimension be improved? I here present the concept of historical dimensions as a way to solve some issues with the basic Type 2 slowly changing dimension promoted by Kimball. As we will see, clearly distinguishing between current and past dimension values pays off in clarity of design, flexibility of presentation, and ease of ETL maintenance.

As I was reading this, I was thinking “This sounds like a type 4 SCD” and Vince walks us through the differences between the two ideas. I’m not absolutely sold on the idea, but it is certainly interesting.

Comments closed

Managing Lakehouse Data

Published 2020-08-04 by Kevin Feasel

Harsha Gummadavelli gives us an introduction to the Data Lakehouse concept:

“Data Lakehouse” is a new architecture paradigm in the data management space that combines the best characteristics of Data Warehouse and Data Lakes. Once you load the data into a data lake, there is no need to load the data into a warehouse for additional analysis or business intelligence. You can directly query the data residing in cheaper but highly reliable storage, often termed as “Object Stores”, thus reducing the operational overhead on data pipelines.

I will say that I’m not particularly sold on the data lakehouse concept at this point. It’s interesting, in that it reduces the number of systems to maintain by one, but I do wonder about performance issues when trying to replace an existing warehouse. The post turns into a marketing pitch for Informatica, but the first half does give a fair introduction to the concept.

Comments closed

Trust and Warehouse Data

Published 2020-07-14 by Kevin Feasel

Rob Farley explains one way that people might lose trust in your warehouse data:

The scenario is that there’s a source system, and there’s a table in a warehouse that is being used to report on it. Maybe it’s being populated by Integration Services or Data Factory. Maybe it’s being populated by T-SQL. I don’t really care. What I care about is whether the data in the warehouse is a true representation of what’s in the source system.
If it’s not a true representation, then we have all kinds of problems.
Mostly, that our warehouse is rubbish.

Read on for an example of how this might occur and what you can do to prevent it.

Comments closed

A Brief Overview of Azure Synapse Analytics

Published 2020-07-13 by Kevin Feasel

Ginger Grant gives us the nickel tour of why Azure Synapse Analytics is interesting:

In the past few months, I have been examining Azure Synapse and what it can do. When it was first released in November of 2019, the first functionality that was rolled out was an update of Azure SQL DW. For this reason, many people think that Synapse is just an improved version of a cloud data warehouse. Microsoft did improve SQL DW when it moved it to Synapse. The biggest architectural design change is the separation of the code from the compute, a theme with many web projects, which allows the compute power to be increased when need dictates and scaled down when computing needs change. Within Synapse, resources are allocated as Pools and you can define a sql pools to run data warehouse and later change the compute to a different resource. You will still need to partition your DW as large datasets require partitioning to perform well. Subsequently Microsoft Released the Azure Synapse Studio to be a container for a larger environment of tools and notebooks to interact with them.

But it’s more than that. Read on to see what else is available.

Comments closed

Query Scheduling with Apache Hive

Published 2020-07-09 by Kevin Feasel

Zoltan Haindrich and Jesus Camacho Rodriguez walk us through scheduled queries in Apache Hive:

To fulfill that purpose, recently Apache Hive introduced a new feature called scheduled queries. Using SQL statements, users can schedule Hive queries to run on a recurring basis, monitor their progress, and optionally disable a query schedule.
In a nutshell, every scheduled query in Hive consists of (i) a unique name to identify the schedule, (ii) the actual SQL statement to be executed, and (iii) the schedule at which the query should be executed defined by a Quartz cron expression. In addition, a scheduled query belongs to a namespace, i.e., a collection of HiveServer2 instances that are responsible to execute the query.

Read on for examples of how you might create, use, and learn about scheduled queries running on a system.

Comments closed

Role-Playing Dimensions in Power BI

Published 2020-06-29 by Kevin Feasel

Martin Schoombee explains the concept of role-playing dimensions and then explains how that works in the Power Bi world:

In technical terms a role-playing dimension is when a dimension table has multiple (foreign key) relationships to the same fact table. In more non-technical terms, it is when you have the same attribute (“Date” for instance) that can relate to the same metrics in different ways.
If you look at the data model below as an example, you can see the Date entity could be used to reference either the Invoice Date or Delivery Date from the Sales entity, changing the perspective of the metrics we’re looking at.

It’s a little surprising to me that there isn’t an easier way to handle this concept. Role-playing dimensions are a core part of the Kimball model, and they’re common enough that you’d expect support to be a bit simpler.

Comments closed

Thoughts on Snowflake Database Provisioning

Published 2020-06-25 by Kevin Feasel

David Stelfox takes us through some thoughts on provisioning instances of Snowflake:

For this example, I’ve chosen an open dataset of 2017 taxi rides in New York City. There are a few options for interacting with Snowflake: a dialog box approach in the web-based GUI, using SQL statements in the Worksheets tab in the GUI or a CLI called SnowSQL. For this example, I used SQL statements as I find them easier to follow what’s happening. Once you have set up your account (or trial) and logged in, you need to create your first database.

Click through for some how-to as well as thoughts about cost and performance.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31