Data Lake – Page 2 – Curated SQL

Spring Cleaning for Lakehouse Tables with VACUUM

Published 2025-04-21 by Kevin Feasel

Chen Hirsh says it’s time to do a bit of cleanup:

Delta tables create new files for every change made to the table (insert, update, delete). You can use the old files to “time travel” – to query or restore older versions of your table. This is a wonderful feature, but over time, these files accumulate in your storage and will increase your storage costs.

Read on for a primer of the VACUUM command, how frequently you might want to run the command, and how much data you want to save. This example is specifically around using Databricks, but the mechanisms work exactly the same for other lakehouses like Microsoft Fabric.

Comments closed

Comparing Apache Iceberg to Delta Lake

Published 2025-04-18 by Kevin Feasel

Maria Zakourdaev compares technologies:

Public cloud blob storage has been a standard for data lakes for the last 10 years. Blob storage, at first, came to solve data warehouse storage limitations. It is very cheap and has unlimited capacity. You can store any data format (structured, semi-structured, or unstructured) in the data lake located on a blob storage, and keep any amount of raw data for an unlimited time. When considering Apache Iceberg vs Delta Lake, both can manage data efficiently. Depending on the access frequency, data can be stored on cold or warm types of cloud storage, saving even more costs.

Read on to see how the two techniques compare along several dimensions, as well as some general guidance at the end on which to choose.

Comments closed

Securing Parquet Files

Published 2025-04-17 by Kevin Feasel

Vamshidhar Morusu writes on vulnerabilities:

Although open-source Java libraries are essential for contemporary software development, they frequently introduce serious security flaws that put systems at risk. The risks are highlighted by recent examples:

Deep Java Library (CVE-2025-0851): Attackers can write files outside of designated directories due to a path traversal vulnerability in DJL’s archive extraction tools. Versions 0.1.0 through 0.31.0 are affected by this vulnerability, which may result in data corruption or illegal system access. Version 0.31.1 has a patch for it.

CVE-2022-42003, Jackson Library: Unsafe serialization/deserialization configurations in the well-known JSON parser cause a high-severity problem (CVSS 7.5) that could result in denial-of-service attacks.

These illustrations highlight how crucial it is for open-source libraries to have careful dependency management, frequent updates, and security audits. Companies should enforce stringent validation and use automated vulnerability scanning tools.

Click through for a more detailed view of a third CVE, as well as tips to protect your data.

Comments closed

Iceberg Data Support in OneLake

Published 2025-03-18 by Kevin Feasel

Matthew Hicks isn’t replicating data anymore:

Microsoft OneLake is the single, unified, logical data lake that allows your entire organization to store, manage, and analyze data in one place. It provides seamless integration with various data sources and engines, making it easier to derive insights and drive innovation.

At the most recent Microsoft Build conference, we announced the integration effort between Snowflake and OneLake, which aims to allow users of both Snowflake and Microsoft Fabric to work on the same Iceberg data in OneLake, with no data duplication/movement needed. More recently, we released the preview of OneLake’s Iceberg table format support, which included the ability for Snowflake to write Iceberg tables directly to OneLake.

Click through for more information about the current status of this feature, as well as what’s coming soon.

Comments closed

Writing Data into a Microsoft Fabric Lakehouse via Notebook

Published 2025-03-12 by Kevin Feasel

Stepan Resl writes some code:

Since Lakehouse is one of the key items within Microsoft Fabric, it is important to know how to write data into it in various formats and using different tools. One of the most common tools is notebooks, as they provide great flexibility and speed for development and testing with graphical outputs. In this article, I want to focus primarily on the following types of notebooks:

PySpark

Python

Click through to see how it works in both notebook types.

Comments closed

Table Compaction in Apache Spark

Published 2025-02-27 by Kevin Feasel

Miles Cole groups things together:

If there anything that data engineers agree about, it’s that table compaction is important. Often one of the first big lessons that folks will learn early on is that not compacting tables can present serious performance issues: you’ve gotten your lakehouse pilot approved and it’s been running for a couple months in production and you find that both reads and writes are increasingly getting slower and slower while your data volumes have not increased drastically. Guess what, you almost surely have a “small file problem”.

What engineers won’t always sing the same tune on is how and when to perform table compaction.

Read on for a dive into the power of compaction (converting a large number of small files into a small number of large files) and plenty of tips along the way.

Comments closed

Data Retention for Data in the Microsoft Fabric Lakehouse

Published 2025-01-29 by Kevin Feasel

Kenneth Omorodion clears out some data:

More than before, organizations now aim for a well-defined approach to manage their data storage effectively. Some reasons for this include operational efficiency, cost management, regulatory compliance, and strategic decision-making. In this article, I will describe an approach on data retention management for Lakehouse files to manage data storage when the data exists as files in the Fabric Lakehouse.

There’s nothing built in but Kenneth makes it easy.

Comments closed

Data Lakes, Warehouses, and Lakehouses

Published 2025-01-06 by Kevin Feasel

Noa Shavit disambiguates three terms:

A data warehouse is a repository and platform for storing, querying, and manipulating data. Warehouses are particularly suited for structured data used for decision support and business intelligence. Modern data warehouses have become more efficient, flexible, and scalable (particularly in the context of massively parallel processing and distributed computation), but they still bear the mark of their early development in the previous century.

The data warehouse concept dates back to data marts in the 1970s. After a long incubation period, the idea began to bear fruit commercially at IBM in the late 1980s and early 1990s. Data warehousing improved on the inefficiency of data marts, siloed data stores maintained by individual departments.

Click through to learn more about each of the three concepts and how they relate.

Comments closed

Implementing a Star Schema in a Microsoft Fabric Lakehouse

Published 2024-09-30 by Kevin Feasel

Nikola Ilic builds a lakehouse:

But, what is a star schema in the first place? I have good and bad news for you:)…The bad news is that I’m not covering it in this article because this one focuses on explaining how to implement a star schema in Fabric L akehouse (assuming that you already know what star schema is). The good news is: I’ve already written about it, so go and read this article first, if you’re not sure what star schema represents in the world of data modeling…

In one of the previous articles, I also shown how to implement a star schema in Power BI, by leveraging Power Query Editor.

Now, let’s get our hands dirty and build a star schema by using PySpark in the Fabric notebook!

Click through to see how.

Comments closed

Microsoft Fabric Direct Lake and Reframing Operations

Published 2024-09-10 by Kevin Feasel

Reza Rad changes the frame:

Power BI offers a new type of connection to Microsoft Fabric Lakehouse or Warehouse, called Direct Lake. The Direct Lake connection acts like DirectQuery and won’t need the data to be refreshed. However, the Power BI semantic model has refresh settings that can be turned on or off. In this article and video, you will learn about the Refresh settings for the Power BI semantic model that is connected using a Direct Lake connection, what that is, and why it is called Reframe.

Read on to learn more, or to check out the video.

Comments closed

Category: Data Lake