Data Lake – Curated SQL

Getting Started with CF.Cumulus Community Edition

Published 2025-06-23 by Kevin Feasel

For those who have been following along with our product CF.Cumulus, we have been gearing up for some exciting developments and want to give more power and independence to users. As such, we’re putting together some comprehensive “How-to” guides to simplify the deployment process for Community Edition users.

This deployment guide walks you through setting up CF.Cumulus with the Azure Resources depicted below.

Click through for the guide.

OneLake Security and Shortcuts

Published 2025-05-29 by Kevin Feasel

Aaron Merrill explains how OneLake security works when you introduce shortcuts:

OneLake allows for security to be defined once and enforced consistently across Microsoft Fabric. One of its standout features is its ability to work seamlessly with shortcuts, offering users the flexibility to access and organize data from different locations while maintaining robust security controls. In this blog post, we will look at how OneLake security is integrated with shortcuts, explain the distinction between passthrough and delegated auth modes for shortcuts, and look at an example use case.

Read on for an overview of OneLake shortcuts, as well as different security models around them.

Comments closed

Building an ML-Friendly Data Lake with Apache Iceberg

Published 2025-05-23 by Kevin Feasel

Anant Kumar designs a data lake:

As companies collect massive amounts of data to fuel their artificial intelligence and machine learning initiatives, finding the right data architecture for storing, managing, and accessing such data is crucial. Traditional data storage practices are likely to fall short to meet the scale, variety, and velocity required by modern AI/ML workflows. Apache Iceberg steps in as a strong open-source table format to build solid and efficient data lakes for AI and ML.

Click through for a primer on Iceberg, how to set up a fairly simple data lake, and some functionality that can help in model training.

Comments closed

Writing DAX Query Outputs to Lakehouse Tables

Published 2025-05-15 by Kevin Feasel

Gilbert Quevauvilliers does a bit of writing:

In this blog post I am going to explain how to use a Python Notebook using the Semantic Link module, to run a DAX query and write the output to a Lakehouse table.

I will show you how to install a Python library and then use it within my python notebook.

Read on for a quick primer on Semantic Link Labs, followed by the meat of the article.

Comments closed

Reading Delta Tables via SQL Code in a Microsoft Fabric Python Notebook

Published 2025-04-30 by Kevin Feasel

Gilbert Quevauvilliers writes a SQL statement:

I come from a TSQL background, so using SQL makes it easy for me to work with data.

There are multiple ways to use SQL in a PySpark notebook, and when I started using a Python notebook it was not so straightforward.

In this blog post I will show you how I use SQL Code.

As mentioned previously I am by no means an expert, I typically find a way that works, is fast and doesn’t consume a lot of capacity. If that works consistently for me then that is how I go about it.

Click through for the solution, which uses DuckDB. As such, the SQL syntax isn’t T-SQL—it’s more like psql. But it does do a great job of interacting with Parquet files and Delta tables.

Comments closed

Two Direct Lakes in Microsoft Fabric

Published 2025-04-25 by Kevin Feasel

Nikola Ilic does a bit of digging:

Before you proceed, in case you don’t know what Direct Lake is, I’ve got you covered in this article, where you can learn and understand various Direct Lake concepts, as well as in which scenarios you might consider implementing Direct Lake semantic models. Now that you know what Direct Lake is, let’s digest the latest news…

A couple of days ago, I was reading the official blog post about the latest enhancement to the Direct Lake storage mode for semantic models in Microsoft Fabric. The official blog post can be found here.

Click through for that announcement and what it means.

Comments closed

Spring Cleaning for Lakehouse Tables with VACUUM

Published 2025-04-21 by Kevin Feasel

Chen Hirsh says it’s time to do a bit of cleanup:

Delta tables create new files for every change made to the table (insert, update, delete). You can use the old files to “time travel” – to query or restore older versions of your table. This is a wonderful feature, but over time, these files accumulate in your storage and will increase your storage costs.

Read on for a primer of the VACUUM command, how frequently you might want to run the command, and how much data you want to save. This example is specifically around using Databricks, but the mechanisms work exactly the same for other lakehouses like Microsoft Fabric.

Comments closed

Comparing Apache Iceberg to Delta Lake

Published 2025-04-18 by Kevin Feasel

Maria Zakourdaev compares technologies:

Public cloud blob storage has been a standard for data lakes for the last 10 years. Blob storage, at first, came to solve data warehouse storage limitations. It is very cheap and has unlimited capacity. You can store any data format (structured, semi-structured, or unstructured) in the data lake located on a blob storage, and keep any amount of raw data for an unlimited time. When considering Apache Iceberg vs Delta Lake, both can manage data efficiently. Depending on the access frequency, data can be stored on cold or warm types of cloud storage, saving even more costs.

Read on to see how the two techniques compare along several dimensions, as well as some general guidance at the end on which to choose.

Comments closed

Securing Parquet Files

Published 2025-04-17 by Kevin Feasel

Vamshidhar Morusu writes on vulnerabilities:

Although open-source Java libraries are essential for contemporary software development, they frequently introduce serious security flaws that put systems at risk. The risks are highlighted by recent examples:

Deep Java Library (CVE-2025-0851): Attackers can write files outside of designated directories due to a path traversal vulnerability in DJL’s archive extraction tools. Versions 0.1.0 through 0.31.0 are affected by this vulnerability, which may result in data corruption or illegal system access. Version 0.31.1 has a patch for it.

CVE-2022-42003, Jackson Library: Unsafe serialization/deserialization configurations in the well-known JSON parser cause a high-severity problem (CVSS 7.5) that could result in denial-of-service attacks.

These illustrations highlight how crucial it is for open-source libraries to have careful dependency management, frequent updates, and security audits. Companies should enforce stringent validation and use automated vulnerability scanning tools.

Click through for a more detailed view of a third CVE, as well as tips to protect your data.

Comments closed

Iceberg Data Support in OneLake

Published 2025-03-18 by Kevin Feasel

Matthew Hicks isn’t replicating data anymore:

Microsoft OneLake is the single, unified, logical data lake that allows your entire organization to store, manage, and analyze data in one place. It provides seamless integration with various data sources and engines, making it easier to derive insights and drive innovation.

At the most recent Microsoft Build conference, we announced the integration effort between Snowflake and OneLake, which aims to allow users of both Snowflake and Microsoft Fabric to work on the same Iceberg data in OneLake, with no data duplication/movement needed. More recently, we released the preview of OneLake’s Iceberg table format support, which included the ability for Snowflake to write Iceberg tables directly to OneLake.

Click through for more information about the current status of this feature, as well as what’s coming soon.

Comments closed

Category: Data Lake