Press "Enter" to skip to content

Category: Warehousing

Exposing Kafka Data in Iceberg using Tableflow

Marc Selwan announces a new product:

We’re excited to talk about our vision for Tableflow, which makes it push-button simple to take Apache Kafka® data and feed it directly into your data lake, warehouse, or analytics engine as Apache Iceberg® tables. Making operational data accessible to the analytical world is traditionally a complex, expensive, and brittle process and we believe we can do better to unify the operational and analytical estates.

Tableflow removes all this erroneous, duplicative work and helps convert Kafka topics and associated schemas to Iceberg tables in one click. This is central to our Confluent’s vision to build the world’s leading data streaming platform that fuels any operational and analytical workload with real-time data products. 

It looks like this is currently in early access, but you can see where Confluent intends to take the product.

Comments closed

Dropping Objects in SQL Server and Snowflake

Kevin Wilkie gets the drop on us:

When you’re working between SQL Server and Snowflake, there can be a lot of crossover that may make you forget what system you’re working in. Sometimes it’s close, but not close enough.

Today, let’s go over something that should be rather simple – removing old objects that we shouldn’t need any longer.

Read on to see how the two data platform technologies differ in this regard.

Comments closed

Visualizing Genomics Data with Streamlit in Snowflake

Rebecca O’Connor builds an app:

The blog describes how this data is complemented with the following two additional data sets:

  • An Annotation dataset
  • A Panel Dataset

Simple SQL queries is then used gain answers to a multitude of questions held within the vast amount of data.

I utilised the same datasets in order to create a streamlit app.

Click through for the code. This is the reason why I like Streamlit so much: you can build an interactive data-centric application very easily. Granted, you can abuse Streamlit pretty hard, but it is powerful.

Comments closed

Data Vault 2.0 Models in Microsoft Fabric

Michael Olschimke and Dmytro Polishchuk continue a series:

The last article in this blog series discussed the basic entity types in Data Vault 2.0: hubs, links and satellites. While it would be theoretically possible to limit a model to just these three basic entity types, the resulting Data Vault model would be inefficient: it would most likely consume too much storage, be less efficient due to the many joins, and require a number of grain shifts during information delivery. This is due to certain characteristics in the data that require special treatment.

For these characteristics, Data Vault 2.0 provides special entity types that deal with the specialities. This article focuses on two of them: the non-historized link, which is used to capture transactions and events, and the multi-active satellite, which is used to model multiple active descriptions for the same parent hub or link in the same load.

Read on for an example of how to implement this in a Microsoft Fabric warehouse.

Comments closed

Create and Connect to a Fabric Data Warehouse

Olivier Van Steenlandt builds a warehouse:

In this data recipe series, Microsoft Fabric – Data Warehouse will be explored. As a starting point, a blank Fabric workspace is used. You can sign up for a free Fabric trial by using the following URL: Data Analytics | Microsoft Fabric

In this data recipe, we will create a brand-new Data Warehouse in Fabric. Once created, we will connect to our Data Warehouse using Azure Data Studio.

Click through for the step-by-step process.

Comments closed

Common Warehouse Load Patterns

Ben Johnston continues a series on warehouse load patterns:

This continues and finishes my two-part series on warehouse load patterns. There are many methods to transfer rows between systems from a basic design perspective. This isn’t specific to any ETL tool but rather the basic patterns for moving data. The most difficult part in designing a pattern is efficiency. It has to be accurate and not adversely impact the source system, but this is all intertwined and dependent on efficiency. You only want to move the rows that have changed or been added since the previous ETL execution, deltas. This reduces the network load, the source system load (I/O, CPU, locking, etc.), the destination system load. Being efficient also improves the speed and as a direct result it increases the potential frequency for each ETL run, which has a direct impact on business value.

The pattern you select depends on many things. The previous part of the series covers generic design patterns and considerations for warehouse loads that can be applied to most of the ETL designs presented below. This section covers patterns I have used in various projects. I’m sure there are some patterns I have missed, but these cover the most used types that I have seen. These are not specific to any data engine or ETL tool, but the examples use SQL Server as a base for functionality considerations. Design considerations, columns available, administrative support, DevOps practices, reliability of systems, and cleanliness of data all come into consideration when choosing your actual ETL pattern.

Click through for a compendium of common patterns you can use to indicate that a row should go into a warehouse.

Comments closed

Warehousing and Power BI in Microsoft Fabric

Tomaz Kastrun continues a series on Microsoft Fabric. Day 15 covers building a warehouse:

I have named my as “Advent2023_DWH”.

You can create a warehouse using T-SQL scripts, from data flow gen2, from data pipelines and from the sample data. Let’s select the sample data and grab a coffee.

Day 16 looks at data pipelines:

With the Fabric warehouse created and explored, let’s see, how we can use pipelines to get the data into Fabric warehouse.

In the existing data warehouse, we will introduce new data. By clicking “new data”, two options will be available; pipelines and dataflows. Select the pipelines and give it a name.

And Day 17 provides a primer on how Power BI can read Fabric assets:

Within the Power BI in Fabric, you will find many of the components, that can be used to create a final report. And here are the components:

Comments closed

ML Models and Data Warehouses in Microsoft Fabric

Tomaz Kastrun continues a series on Microsoft Fabric. First up is creating ML models:

Protip: Both experiments and the ML model version look similar, and you can intuitively switch between both of them. But do not get confused, as the ML Model version applies the best-selected model from the experiment and can be used for inference.

Then we switch context to data warehousing:

Today we will start exploring the Fabric Data Warehouse.

With the data lake-centric logic, the data warehouse in Fabric is built on a distributed processing engine, that enables automated scaling. The SaaS experience creates a segway to easier analysis and reporting, and at the same time gives the ability to run heavy workloads against open data format, simply by using transact SQL (T-SQL). Microsoft OneLake gives all the services to hold a single copy of data and can be consumed in a data warehouse, datalake or SQL Analytics.

Comments closed

Constraints in Microsoft Fabric Data Warehouses

Brian Bønk slips out of the constraints:

When working with data and building data models, I personally seldom use the constraints feature on a database. Call me lazy – but I think constraints are adding unnessesary complexity when building data models for reporting. Especially if you are working with the some of new platforms – like Microsoft Fabric, where you are using staleless compute, aka. data storage is seperated from the compute layer.

I understand the need for contraints on other database systems like OLTP systems.

In reporting models it can be somewhat usefull to have constraints between tables, as they help/force you to some level of governance in your datamodel.

But how can we use this in Microsoft Fabric and are they easy to work with?

Read on for those answers. I will note that I’m a stickler about constraints in transactional systems, though I agree that constraints in warehouses are not critical—assuming, at least, that you’re following the Kimball approach and have one and only one mechanism to write data, and that you have other mechanisms for vetting data quality.

Comments closed

Data Warehouse ETL Patterns

Ben Johnston starts a new series:

No matter the ETL tool used, there are some basic patterns to follow when transferring data between systems. There are many data tools and platforms, but the basic patterns remain the same. This focuses on SQL Server, but most of these methods work in any data platform. Even if you are using a virtualization layer, you likely need to prepare the data before exposing it to that engine, which means ETL and data transfers.

Warehouse is very loosely a data warehouse, but the same process applies to other systems. This includes virtualization layers, and to a smaller degree, bulk transfers between transactional systems.

Read on for a few things Ben recommends you have in place before beginning the project, as well as several warehouse loading patterns.

Comments closed