Press "Enter" to skip to content

Category: Warehousing

Real-Time Data Warehousing in Cloudera

Justin Hayes gives us an overview of using Cloudera Data Platform for real-time data warehousing:

The simplest way to describe a RTDW is that it looks and feels like a normal data warehouse, but everything is faster even while massive scale is maintained. It is a type of data warehouse modernization that lets you have “small data” semantics and performance at “big data” scale.

– the data arrives into the warehouse faster – think streams of many millions of events per second constantly arriving

– the time it takes for the data to be optimally queryable is faster – query immediately upon arrival with no need for processing or aggregation or compaction

– the speed at which queries run is faster – small, selective queries are measured in 10s or 100s of milliseconds; large, scan- or compute-heavy queries are processed at very high bandwidth

– mutations of the data, when needed, are fast – if data needs to be corrected or updated for whatever reason, this can be done in place without large rewrites

Read on for more.

Comments closed

Tips for Optimizing Dedicated SQL Pools in Synpase Analytics

Tsuyoshi Matsuzaki shares some tips for improving query performance when using Dedicated SQL Pools in Azure Synapse Analytics:

By above BROADCAST_MOVE operation, the rows in dimension_City table are all copied in a temporary table (called TEMP_ID_3) on all distributed database. (See below.)
Since the size of dimension_City is small, then all rows in this table is duplicated in all database before joining. This time, we join only 2 tables, however, if a lot of tables are needed to join, this data movement will become large overhead for query execution.

The short version is, replicate smaller dimensions and align distribution keys for large tables which get joined together. Both of these minimize the changes of the engine needing to shuffle data between nodes. These sorts of things can make a huge difference when working with Dedicated SQL Pools, cutting query time down by an order of magnitude in some extreme cases.

Comments closed

Integrating Power BI into Azure Synapse Analytics

Ginger Grant walks us through two methods of integrating Power BI and Azure Synapse Analytics:

From within Synapse you have the ability to access a Power BI workspace so that you can use Power BI from within Synapse.  Your Power BI tenant can be in a different data center than the Azure Synapse Workspace, but they both must be in the same Power BI Tenant.  You can use Power BI to look at any data you wish, as the data you use can be from any location. When this blog was written, it was only possible to connect to one Power BI workspace from within Azure Synapse. In order to run Power BI as shown here, first I needed to create a Linked Service from within Synapse.

Read on for more.

Comments closed

Azure Synapse Analytics Sample Datasets and Scripts

James Serra shows us where to find samples for Azure Synapse Analytics:

Datasets: A bunch of datasets that when added will show up under Data -> Linked -> Azure Blob Storage.  You can then choose an action (via “…” next to any of the containers in the dataset) and choose New SQL script -> Select TOP 100 rows to examine the data as well as choose “New notebook” to load the data into a Spark dataframe.  Any dataset you add is a linked service to files in a blob storage container using SAS authentication.  You can also create an external table in a SQL on-demand pool or SQL provisioned pool to each dataset via an action (via “…” next to “External tables” under the database, then New SQL script -> New external table) and then query it or insert the data into a SQL provisioned database

Click through to learn more, as well as a few other things you can do with Synapse Analytics.

Comments closed

An Introduction to Data Vault

Tino Zishiri walks us through the basics of the Data Vault modeling technique:

The Data Vault methodology also addresses a common limitation that relates to the dimensional model approach. There are many good things to say about dimensional modelling, it’s a perfect fit for doing analytics, it’s easy for business analysts to understand, it’s performant over large sets of data, the list goes on.

That said, the data vault methodology addresses the limitations of having a “fixed” model. Dimensional modelling’s resilience to change or “graceful extensibility”, as some would say, is well documented. It’s capable of handling changing data relationships which can be implemented without affecting existing BI apps or query results. For example, facts consistent with the grain of an existing fact table can be added by creating new columns. Moreover, dimensions can be added to an existing fact table by creating new foreign key columns, presuming they don’t alter the fact table’s grain.

The most interesting thing to me about Data Vault is that it’s very popular in Europe and almost unheard-of in North America. That’s the impression I get, at least.

Comments closed

Data Lineage and SSIS

Aveek Das has a two-parter. First up is a discussion of data lineage:

In this article, I am going to explain what Data Lineage in ETL is and how to implement the same. In this modern world, where companies are dealing with a humongous amount of data every day, there also lies a challenge to efficiently manage and monitor this data. There are systems that generate data every second and are being processed to a final reporting or monitoring tool for analysis. In order to process this data, we use a variety of ETL tools, which in turn makes the data transformation possible in a managed way.

While transforming the data in the ETL pipeline, it has to go through multiple steps of transformations in order to achieve the final result. For example, when the ETL receives the raw data from the source, there may be operations applied to it like filtering, sorting, merging, or splitting two columns, etc. There can also be aggregations or other calculations made on this raw data before finally moving into a data warehouse or preparing it for reporting. In order to be able to detect what the source of a particular record is, we need to implement something known as Data Lineage. It is a piece of simple metadata information that helps us detect gaps in the data processing pipeline and enables us to fix issues later.

Part two covers data lineage with SQL Server Integration Services:

In this article, I am going to discuss SSIS data lineage concepts, which are often used while designing ETL workloads on a data warehouse. Although this article is focused on implementing data lineage using SSIS, it does not only confine to SSIS but to any ETL tools in the market using which data is moved from one source to a destination. In my previous article, Understanding Data Lineage in ETL, I have already discussed the generic importance of data lineage concepts for any ETL tool. I would definitely suggest you have a look at it if you want to understand in general how data lineage helps to track the source of a single record in the warehouse.

If you’re fairly new to this world, it’s a good introduction to an important topic.

Comments closed

A Review of Azure Synapse Analytics

Teo Lachev looks at Azure Synapse Analytics:

There is plenty to like in Azure Synapse which is the evaluation of Azure SQL DW. If you’re tasked to implement a cloud-based data warehouse, you have a choice among three Azure SQL Server-based PaaS offerings, including Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse. In a nutshell, Azure SQL Database and Azure SQL MI are optimized for OLTP workloads. For example, they have full logging enabled and replicate each transaction across replicas. Full logging is usually a no-no for decent size DW workloads because of the massive ETL changes involved.

In addition, to achieve good performance, you’ll find yourself moving up the performance tiers and toward the price point of the lower Azure Synapse SKUs. Not to mention that unlike Azure SQL Database, Azure Synapse can be paused, such as when reports hit a semantic layer instead of DW, and this may offer additional cost cutting options.

Teo focuses primarily on SQL Pools and the more SQL-friendly side of things (ELT and Power BI) rather than Spark pools.

Comments closed

SQL Serverless in Azure Synapse Analytics

James Serra talks to us about SQL serverless (presently known as SQL on-demand but I’m getting ahead of the marketing curve this time):

Querying data in ADLS Gen2 storage using T-SQL is made easy because of the OPENROWSET function with additional capabilities (check out the T-SQL that is supported). The currently supported file types in ADLS Gen2 that SQL-on-demand can use are Parquet, CSV, and JSON. ParquetDirect and CSV 2.0 add performance improvements (see Benchmarking Azure Synapse Analytics – SQL Serverless, using .NET Interactive). You can also query folders and multiple files and use file metadata in queries.

Read on to learn a lot more about its use cases.

Comments closed

Historical Dimensions in a Kimball-Style Model

Vince Iacoboni takes a stab at improving the Kimball model:

We owe a lot to Ralph Kimball and friends. His practical warehouse design and conformed-dimension bus architecture are the industry standard. Business users can understand and query these warehouses directly and gain valuable insights into the business. Kimball’s practical approach focuses squarely on clarity and ease of use for the business users of the warehouse. Kudos to you and yours, Mr. Kimball.

That said, can the mainstay Type 2 slowly changing dimension be improved? I here present the concept of historical dimensions as a way to solve some issues with the basic Type 2 slowly changing dimension promoted by Kimball. As we will see, clearly distinguishing between current and past dimension values pays off in clarity of design, flexibility of presentation, and ease of ETL maintenance.

As I was reading this, I was thinking “This sounds like a type 4 SCD” and Vince walks us through the differences between the two ideas. I’m not absolutely sold on the idea, but it is certainly interesting.

Comments closed

Managing Lakehouse Data

Harsha Gummadavelli gives us an introduction to the Data Lakehouse concept:

“Data Lakehouse” is a new architecture paradigm in the data management space that combines the best characteristics of Data Warehouse and Data Lakes. Once you load the data into a data lake, there is no need to load the data into a warehouse for additional analysis or business intelligence. You can directly query the data residing in cheaper but highly reliable storage, often termed as “Object Stores”, thus reducing the operational overhead on data pipelines.

I will say that I’m not particularly sold on the data lakehouse concept at this point. It’s interesting, in that it reduces the number of systems to maintain by one, but I do wonder about performance issues when trying to replace an existing warehouse. The post turns into a marketing pitch for Informatica, but the first half does give a fair introduction to the concept.

Comments closed