Press "Enter" to skip to content

Category: Warehousing

Modern Data Warehousing with Data Lake Storage and Azure Data Factory

Josephine Bush continues a series on modern data warehousing:

In today’s data-driven world, having the right tools to manage and process large datasets is crucial. That’s where Azure Data Lake Storage (ADLS) and Azure Data Factory (ADF) come in handy, making it easier than ever to store and transform your data. In this post, I’ll show you how to set up ADLS to store your Parquet files and configure ADF to manage your data flows efficiently.

Read on for an overview of both technologies.

Comments closed

Getting Row Counts of All Tables in a Microsoft Fabric Warehouse

Koen Verbeeck busts out the tally counter:

It says the data is 352MB in size, but after loading the data I was curious about how many rows were actually in that sample data set. Unfortunately, it’s not as straight forward as with a “normal” SQL Server database to get the row counts. First of all, when you connect with SSMS to the database there’s sadly no option to get the row counts report:

The post is a little depressing, really. But still worth the read.

Comments closed

Implementing a Star Schema for a Power BI Semantic Model

Nikola Ilic reminds us to keep Ralph Kimball’s Data Warehouse Toolkit book at hand:

But, what is a star schema in the first place? I have good and bad news for you:)…The bad news is: I’m not covering it in this article, because this one focuses on explaining how to implement a star schema in Power BI (assuming that you already know what star schema is). The good news is: I’ve already written about it, so go and read this article first, if you’re not sure what star schema represents in the world of data modeling…

Now, let’s get our hands dirty and build a star schema!

Read on for the demo.

Comments closed

Mirroring Snowflake to Microsoft Fabric

Reza Rad hogs the photocopier:

Microsoft Fabric offers an end-to-end SaaS analytics solution; however, the world is using all kinds of data sources in its implementation. Mirroring is a new functionality in Fabric that allows customers to keep their data wherever they are, but then they can use Fabric analytics solutions with the same speed and performance as if their data were in Fabric. Best of all, this won’t cost extra. If you wonder what it is and how it works, read this article.

Click through for the video and article.

Comments closed

Mirroring Azure SQL DB into Microsoft Fabric

Dennes Torres holds up a mirror:

You need to read data from production to build a single source of truth. If you create pipelines reading directly from production, you will create additional load over the production environment. The mirror allows you to do much of the production reporting from the mirror, leaving the production environment to serve other users. Keep in mind, production report, but not analytics report.

Mirroring a production database to Fabric is one method to ensure the load over production will be as low as possible and the data will be transferred fabric to complete the transformations from this point.

Only this? What about avoiding pipeline creation? Not really, you still need to create pipelines, as I will explain ahead.

Click through for the demo and explanation. This is an important thing for people to note: mirroring doesn’t eliminate ELT. You still have the data lake process to work through, as your transactional system does not and should not look like your reporting system.

Comments closed

Exposing Kafka Data in Iceberg using Tableflow

Marc Selwan announces a new product:

We’re excited to talk about our vision for Tableflow, which makes it push-button simple to take Apache Kafka® data and feed it directly into your data lake, warehouse, or analytics engine as Apache Iceberg® tables. Making operational data accessible to the analytical world is traditionally a complex, expensive, and brittle process and we believe we can do better to unify the operational and analytical estates.

Tableflow removes all this erroneous, duplicative work and helps convert Kafka topics and associated schemas to Iceberg tables in one click. This is central to our Confluent’s vision to build the world’s leading data streaming platform that fuels any operational and analytical workload with real-time data products. 

It looks like this is currently in early access, but you can see where Confluent intends to take the product.

Comments closed

Dropping Objects in SQL Server and Snowflake

Kevin Wilkie gets the drop on us:

When you’re working between SQL Server and Snowflake, there can be a lot of crossover that may make you forget what system you’re working in. Sometimes it’s close, but not close enough.

Today, let’s go over something that should be rather simple – removing old objects that we shouldn’t need any longer.

Read on to see how the two data platform technologies differ in this regard.

Comments closed

Visualizing Genomics Data with Streamlit in Snowflake

Rebecca O’Connor builds an app:

The blog describes how this data is complemented with the following two additional data sets:

  • An Annotation dataset
  • A Panel Dataset

Simple SQL queries is then used gain answers to a multitude of questions held within the vast amount of data.

I utilised the same datasets in order to create a streamlit app.

Click through for the code. This is the reason why I like Streamlit so much: you can build an interactive data-centric application very easily. Granted, you can abuse Streamlit pretty hard, but it is powerful.

Comments closed

Data Vault 2.0 Models in Microsoft Fabric

Michael Olschimke and Dmytro Polishchuk continue a series:

The last article in this blog series discussed the basic entity types in Data Vault 2.0: hubs, links and satellites. While it would be theoretically possible to limit a model to just these three basic entity types, the resulting Data Vault model would be inefficient: it would most likely consume too much storage, be less efficient due to the many joins, and require a number of grain shifts during information delivery. This is due to certain characteristics in the data that require special treatment.

For these characteristics, Data Vault 2.0 provides special entity types that deal with the specialities. This article focuses on two of them: the non-historized link, which is used to capture transactions and events, and the multi-active satellite, which is used to model multiple active descriptions for the same parent hub or link in the same load.

Read on for an example of how to implement this in a Microsoft Fabric warehouse.

Comments closed

Create and Connect to a Fabric Data Warehouse

Olivier Van Steenlandt builds a warehouse:

In this data recipe series, Microsoft Fabric – Data Warehouse will be explored. As a starting point, a blank Fabric workspace is used. You can sign up for a free Fabric trial by using the following URL: Data Analytics | Microsoft Fabric

In this data recipe, we will create a brand-new Data Warehouse in Fabric. Once created, we will connect to our Data Warehouse using Azure Data Studio.

Click through for the step-by-step process.

Comments closed