Press "Enter" to skip to content

Category: Data Lake

Exposing Materialized View in Microsoft Fabric Lakehouses

Ed Lima makes some data available to other tools:

In today’s data-driven world, the ability to quickly expose data through modern APIs is crucial. Microsoft Fabric’s API for GraphQL combined with Materialized Lake Views offers a powerful solution that bridges the gap between your Fabric LakeHouse data and application developers who need fast, flexible access to your data.

In this guide, we’ll walk you through how to create a materialized view in a Lakehouse and expose it through a GraphQL API—all within the Microsoft Fabric ecosystem. This approach gives you the best of both worlds: the performance optimization of materialized views and the developer-friendly querying capabilities of GraphQL.

I’d say one interesting reason for why you might want to do this is to feed data to products like Teams, Power Automate, or Copilot Studio. In those cases, having the data be accessible via GraphQL makes it easier than working with finicky connectors that may or may not exist.

Leave a Comment

Stream or Batch Ordering with Apache Iceberg

Jack Vanlightly shows some tradeoffs:

Today I want to talk about stream analytics, batch analytics and Apache Iceberg. Stream and batch analytics work differently but both can be built on top of Iceberg, but due to their differences there can be a tug-of-war over the Iceberg table itself. In this post I am going to use two real-world systems, Apache Fluss (streaming tabular storage) and Confluent Tableflow (Kafka-to-Iceberg), as a case study for these tensions between stream and batch analytics.

Read on for a summary of how two opposite ideas can both be perfectly reasonable.

Comments closed

What’s Old is New Again: Lakebases

Daniel Janik notes the cyclical nature of things:

For years, the narrative pushed was that traditional relational databases were ill-suited for the scale and complexity of modern BI solutions. The marketing was something like: “Databases don’t belong in BI; use Spark!” We embraced distributed computing frameworks, data lakes, and complex ETL pipelines to move data from operational databases into analytical engines. The idea was to separate transactional workloads from analytical ones to ensure performance and scalability. Spark, with its ability to handle massive datasets and flexible processing, became the darling of the data world.

“Remember, Sully, when I said you don’t need databases anymore?”

“Yeah, Matrix, I remember!”

“I lied.”

Comments closed

Incremental Data Load into Parquet Files from Python

Lee Asher loads some data:

Parquet is a column-oriented open-source storage format increasingly used for “big data” analytics. Yet despite its growing popularity as a native format for data lakes and data warehouses, tools for maintaining these environments remain scarce. Getting data from a SQL environment into Parquet isn’t difficult – but how do we maintain that data over time, keeping it current? In other words, if we already have an existing Parquet file, how can we efficiently append new data to it?

In this article, we’ll introduce the Parquet format, explain some strategies for incrementally updating a Parquet repository, and, with a simple Python script, implement a nightly-feed update process.

Not listed in here is one word that I expected: Delta. Because that’s how we normally do incremental data modification in Parquet data. Either that or Apache Iceberg. Lee shows us a different route that can work.

Comments closed

LakeBench Now Available

Miles Cole makes an announcement:

I’m excited to formally announce LakeBench, now in version v0.3, the first Python-based multi-modal benchmarking library that supports multiple data processing engines on multiple benchmarks. You can find it on GitHub and PyPi.

Traditional benchmarks like TPC-DS and TPC-H focus heavily on analytical queries, but they miss the reality of modern data engineering: building complex ELT pipelines. LakeBench bridges this gap by introducing novel benchmarks that measure not just query performance, but also data loading, transformation, incremental processing, and maintenance operations. The first of such benchmarks is called ELTBench and is initially available in light mode.

Click through to see how it works and grab a copy if you’re interested.

Comments closed

Accessing Delta Lake Tables as Iceberg Data

Matthew Hicks makes an announcement:

We’re thrilled to announce an exciting new Preview capability in OneLake: you can now automatically read Delta Lake tables using Apache Iceberg compatible readers, with no need for migration, copying, or manual conversion. This enhancement gives data engineers and analytics teams unprecedented flexibility in how they access and interact with their data.

This is pretty neat, given that Iceberg is the other popular format for data lakes.

Comments closed

Getting Started with CF.Cumulus Community Edition

Matt Collins shares a deployment guide:

For those who have been following along with our product CF.Cumulus, we have been gearing up for some exciting developments and want to give more power and independence to users. As such, we’re putting together some comprehensive “How-to” guides to simplify the deployment process for Community Edition users.

This deployment guide walks you through setting up CF.Cumulus with the Azure Resources depicted below.

Click through for the guide.

Comments closed

OneLake Security and Shortcuts

Aaron Merrill explains how OneLake security works when you introduce shortcuts:

OneLake allows for security to be defined once and enforced consistently across Microsoft Fabric. One of its standout features is its ability to work seamlessly with shortcuts, offering users the flexibility to access and organize data from different locations while maintaining robust security controls. In this blog post, we will look at how OneLake security is integrated with shortcuts, explain the distinction between passthrough and delegated auth modes for shortcuts, and look at an example use case.

Read on for an overview of OneLake shortcuts, as well as different security models around them.

Comments closed

Building an ML-Friendly Data Lake with Apache Iceberg

Anant Kumar designs a data lake:

As companies collect massive amounts of data to fuel their artificial intelligence and machine learning initiatives, finding the right data architecture for storing, managing, and accessing such data is crucial. Traditional data storage practices are likely to fall short to meet the scale, variety, and velocity required by modern AI/ML workflows. Apache Iceberg steps in as a strong open-source table format to build solid and efficient data lakes for AI and ML.

Click through for a primer on Iceberg, how to set up a fairly simple data lake, and some functionality that can help in model training.

Comments closed