Press "Enter" to skip to content

Category: Data Lake

Incremental Data Load into Parquet Files from Python

Lee Asher loads some data:

Parquet is a column-oriented open-source storage format increasingly used for “big data” analytics. Yet despite its growing popularity as a native format for data lakes and data warehouses, tools for maintaining these environments remain scarce. Getting data from a SQL environment into Parquet isn’t difficult – but how do we maintain that data over time, keeping it current? In other words, if we already have an existing Parquet file, how can we efficiently append new data to it?

In this article, we’ll introduce the Parquet format, explain some strategies for incrementally updating a Parquet repository, and, with a simple Python script, implement a nightly-feed update process.

Not listed in here is one word that I expected: Delta. Because that’s how we normally do incremental data modification in Parquet data. Either that or Apache Iceberg. Lee shows us a different route that can work.

Leave a Comment

LakeBench Now Available

Miles Cole makes an announcement:

I’m excited to formally announce LakeBench, now in version v0.3, the first Python-based multi-modal benchmarking library that supports multiple data processing engines on multiple benchmarks. You can find it on GitHub and PyPi.

Traditional benchmarks like TPC-DS and TPC-H focus heavily on analytical queries, but they miss the reality of modern data engineering: building complex ELT pipelines. LakeBench bridges this gap by introducing novel benchmarks that measure not just query performance, but also data loading, transformation, incremental processing, and maintenance operations. The first of such benchmarks is called ELTBench and is initially available in light mode.

Click through to see how it works and grab a copy if you’re interested.

Leave a Comment

Accessing Delta Lake Tables as Iceberg Data

Matthew Hicks makes an announcement:

We’re thrilled to announce an exciting new Preview capability in OneLake: you can now automatically read Delta Lake tables using Apache Iceberg compatible readers, with no need for migration, copying, or manual conversion. This enhancement gives data engineers and analytics teams unprecedented flexibility in how they access and interact with their data.

This is pretty neat, given that Iceberg is the other popular format for data lakes.

Leave a Comment

Getting Started with CF.Cumulus Community Edition

Matt Collins shares a deployment guide:

For those who have been following along with our product CF.Cumulus, we have been gearing up for some exciting developments and want to give more power and independence to users. As such, we’re putting together some comprehensive “How-to” guides to simplify the deployment process for Community Edition users.

This deployment guide walks you through setting up CF.Cumulus with the Azure Resources depicted below.

Click through for the guide.

Comments closed

OneLake Security and Shortcuts

Aaron Merrill explains how OneLake security works when you introduce shortcuts:

OneLake allows for security to be defined once and enforced consistently across Microsoft Fabric. One of its standout features is its ability to work seamlessly with shortcuts, offering users the flexibility to access and organize data from different locations while maintaining robust security controls. In this blog post, we will look at how OneLake security is integrated with shortcuts, explain the distinction between passthrough and delegated auth modes for shortcuts, and look at an example use case.

Read on for an overview of OneLake shortcuts, as well as different security models around them.

Comments closed

Building an ML-Friendly Data Lake with Apache Iceberg

Anant Kumar designs a data lake:

As companies collect massive amounts of data to fuel their artificial intelligence and machine learning initiatives, finding the right data architecture for storing, managing, and accessing such data is crucial. Traditional data storage practices are likely to fall short to meet the scale, variety, and velocity required by modern AI/ML workflows. Apache Iceberg steps in as a strong open-source table format to build solid and efficient data lakes for AI and ML.

Click through for a primer on Iceberg, how to set up a fairly simple data lake, and some functionality that can help in model training.

Comments closed

Reading Delta Tables via SQL Code in a Microsoft Fabric Python Notebook

Gilbert Quevauvilliers writes a SQL statement:

I come from a TSQL background, so using SQL makes it easy for me to work with data.

There are multiple ways to use SQL in a PySpark notebook, and when I started using a Python notebook it was not so straightforward.

In this blog post I will show you how I use SQL Code.

As mentioned previously I am by no means an expert, I typically find a way that works, is fast and doesn’t consume a lot of capacity. If that works consistently for me then that is how I go about it.

Click through for the solution, which uses DuckDB. As such, the SQL syntax isn’t T-SQL—it’s more like psql. But it does do a great job of interacting with Parquet files and Delta tables.

Comments closed

Two Direct Lakes in Microsoft Fabric

Nikola Ilic does a bit of digging:

Before you proceed, in case you don’t know what Direct Lake is, I’ve got you covered in this article, where you can learn and understand various Direct Lake concepts, as well as in which scenarios you might consider implementing Direct Lake semantic models. Now that you know what Direct Lake is, let’s digest the latest news…

A couple of days ago, I was reading the official blog post about the latest enhancement to the Direct Lake storage mode for semantic models in Microsoft Fabric. The official blog post can be found here.

Click through for that announcement and what it means.

Comments closed

Spring Cleaning for Lakehouse Tables with VACUUM

Chen Hirsh says it’s time to do a bit of cleanup:

Delta tables create new files for every change made to the table (insert, update, delete). You can use the old files to “time travel” – to query or restore older versions of your table. This is a wonderful feature, but over time, these files accumulate in your storage and will increase your storage costs.

Read on for a primer of the VACUUM command, how frequently you might want to run the command, and how much data you want to save. This example is specifically around using Databricks, but the mechanisms work exactly the same for other lakehouses like Microsoft Fabric.

Comments closed