Press "Enter" to skip to content

Category: Warehousing

Choosing between Data Warehouses, Lakes, and Lakehouses

Den Smyrnov talks architecture:

Historically, the two most popular approaches to storing and managing data are Data Warehouse and Data Lake. The choice between them usually depends on business objectives and needs. While Data Lakes are ideal for preserving large volumes of diverse data, warehouses are more favorable for business intelligence and reporting. Sometimes, organizations try to have the best of both worlds and mix Data Lake & Data Warehouse architectures. This, however, can be a time and cost-consuming process.

Against this backdrop, a new hybrid approach—Data Lakehouse—has emerged. It combines features of a Data Lake and a Data Warehouse, allowing companies to store and analyze data in the same repository and eliminating the Data Warehouse vs. Data Lake dilemma. Data Lakehouse mixes the scalability and flexibility of a Data Lake with the ability to extract insights from data easily. Ever so compelling, this approach still has certain limitations. It should not be treated as a “one-size-fits-all” solution.

Read on for an explanation of each of these three styles, including their pros and cons.

Comments closed

Time Travel in the Microsoft Fabric Warehouse

Reza Rad hops in the Delorean:

Data changes throughout time, especially in the world of BI and data warehousing systems; the data gets updated through ETL processes frequently. This means that the data you see in the warehouse today might differ from yesterday and the day before, and so on. Some parts of this data can be retrieved on a timely basis. You can, for example, query the sales amount from the sales table where the date has been the 2nd of April. That would give you the sales amount for the 2nd of April, even if you are querying it on the 23rd of May.

However, what if some of the sales transactions on the 2nd of April got updated? The sales amount you see would likely be the updated amount, but not the original amount. It is sometimes useful to be able to see what was that original amount, or in other words, travel in time and see what that value was.

Click through for a combination video and article. The syntax isn’t quite the same as with temporal tables in SQL Server, though it’s close enough to follow along if that’s your relevant experience.

Comments closed

Modern Data Warehousing with Data Lake Storage and Azure Data Factory

Josephine Bush continues a series on modern data warehousing:

In today’s data-driven world, having the right tools to manage and process large datasets is crucial. That’s where Azure Data Lake Storage (ADLS) and Azure Data Factory (ADF) come in handy, making it easier than ever to store and transform your data. In this post, I’ll show you how to set up ADLS to store your Parquet files and configure ADF to manage your data flows efficiently.

Read on for an overview of both technologies.

Comments closed

Getting Row Counts of All Tables in a Microsoft Fabric Warehouse

Koen Verbeeck busts out the tally counter:

It says the data is 352MB in size, but after loading the data I was curious about how many rows were actually in that sample data set. Unfortunately, it’s not as straight forward as with a “normal” SQL Server database to get the row counts. First of all, when you connect with SSMS to the database there’s sadly no option to get the row counts report:

The post is a little depressing, really. But still worth the read.

Comments closed

Implementing a Star Schema for a Power BI Semantic Model

Nikola Ilic reminds us to keep Ralph Kimball’s Data Warehouse Toolkit book at hand:

But, what is a star schema in the first place? I have good and bad news for you:)…The bad news is: I’m not covering it in this article, because this one focuses on explaining how to implement a star schema in Power BI (assuming that you already know what star schema is). The good news is: I’ve already written about it, so go and read this article first, if you’re not sure what star schema represents in the world of data modeling…

Now, let’s get our hands dirty and build a star schema!

Read on for the demo.

Comments closed

Mirroring Snowflake to Microsoft Fabric

Reza Rad hogs the photocopier:

Microsoft Fabric offers an end-to-end SaaS analytics solution; however, the world is using all kinds of data sources in its implementation. Mirroring is a new functionality in Fabric that allows customers to keep their data wherever they are, but then they can use Fabric analytics solutions with the same speed and performance as if their data were in Fabric. Best of all, this won’t cost extra. If you wonder what it is and how it works, read this article.

Click through for the video and article.

Comments closed

Mirroring Azure SQL DB into Microsoft Fabric

Dennes Torres holds up a mirror:

You need to read data from production to build a single source of truth. If you create pipelines reading directly from production, you will create additional load over the production environment. The mirror allows you to do much of the production reporting from the mirror, leaving the production environment to serve other users. Keep in mind, production report, but not analytics report.

Mirroring a production database to Fabric is one method to ensure the load over production will be as low as possible and the data will be transferred fabric to complete the transformations from this point.

Only this? What about avoiding pipeline creation? Not really, you still need to create pipelines, as I will explain ahead.

Click through for the demo and explanation. This is an important thing for people to note: mirroring doesn’t eliminate ELT. You still have the data lake process to work through, as your transactional system does not and should not look like your reporting system.

Comments closed

Exposing Kafka Data in Iceberg using Tableflow

Marc Selwan announces a new product:

We’re excited to talk about our vision for Tableflow, which makes it push-button simple to take Apache Kafka® data and feed it directly into your data lake, warehouse, or analytics engine as Apache Iceberg® tables. Making operational data accessible to the analytical world is traditionally a complex, expensive, and brittle process and we believe we can do better to unify the operational and analytical estates.

Tableflow removes all this erroneous, duplicative work and helps convert Kafka topics and associated schemas to Iceberg tables in one click. This is central to our Confluent’s vision to build the world’s leading data streaming platform that fuels any operational and analytical workload with real-time data products. 

It looks like this is currently in early access, but you can see where Confluent intends to take the product.

Comments closed

Dropping Objects in SQL Server and Snowflake

Kevin Wilkie gets the drop on us:

When you’re working between SQL Server and Snowflake, there can be a lot of crossover that may make you forget what system you’re working in. Sometimes it’s close, but not close enough.

Today, let’s go over something that should be rather simple – removing old objects that we shouldn’t need any longer.

Read on to see how the two data platform technologies differ in this regard.

Comments closed

Visualizing Genomics Data with Streamlit in Snowflake

Rebecca O’Connor builds an app:

The blog describes how this data is complemented with the following two additional data sets:

  • An Annotation dataset
  • A Panel Dataset

Simple SQL queries is then used gain answers to a multitude of questions held within the vast amount of data.

I utilised the same datasets in order to create a streamlit app.

Click through for the code. This is the reason why I like Streamlit so much: you can build an interactive data-centric application very easily. Granted, you can abuse Streamlit pretty hard, but it is powerful.

Comments closed