Data Lake – Page 5 – Curated SQL

Designing for Direct Lake Mode

Published 2024-04-01 by Kevin Feasel

Since the introduction of Power Pivot for Excel, SQL Server Analysis Services Tabular, Azure Analysis Services and Power BI; the native mode for storing data in a semantic data model (previously called a “dataset” in Power BI) has been a proprietary file structure consisting of binary and XML files. These file structures were established in the early days of multidimensional SSAS back in 2000 and 2005. When an Import mode model is published to the Power BI service, deployed to an SSAS server or when Power BI Desktop is running, data for the model is loaded into memory where it remains as long as the service is running. When users interact with a report or when DAX queries are run against the model, results are retrieved very quickly from the data residing in memory. There are some exceptions for very large models or when many models in the service don’t all fit into memory at the same time, the service will page some or all of the model in and out of memory to make sure that the most-often used model pages remain in memory for the next user request. But, for argument’s sake, the entire semantic model sits in memory, waiting for the next report or user request.

Rather than the proprietary SSAS file structure, Direct Lake models use the native Delta-parquet files that store structured data tables for a Fabric lakehouse or warehouse in One Lake. And rather than making a copy of the data in memory, the semantic model is a metadata structure that shares the same Delta-parquet file storage. As soon as a report runs against a model, all of the model data is paged into memory which then behaves a lot like an Import mode model. This means than while the model remains in memory, performance should about the same as Import, with a few exceptions.

Read on to see what the capabilities of Direct Lake mode are today, as well as a few design considerations for your Microsoft Fabric architecture.

Comments closed

Data Management with Open Table Formats

Published 2024-03-07 by Kevin Feasel

Anandaganesh Balakrishnan covers a few open-source products and formats:

Apache Iceberg is an open-source table format designed for large-scale data lakes, aiming to improve data reliability, performance, and scalability. Its architecture introduces several key components and concepts that address the challenges commonly associated with big data processing and analytics, such as managing large datasets, schema evolution, efficient querying, and ensuring transactional integrity. Here’s a deep dive into the core components and architectural design of Apache Iceberg:

Click through for a review of Iceberg, Hudi, and the Delta Lake format.

Comments closed

Time Travel in Delta Tables

Published 2023-12-22 by Kevin Feasel

Manish Mishra shows off some of the query capabilities with delta tables:

Delta Time Travel is a feature that is provided by Delta Lake. Delta time travel allows the user to switch to the previous version of the delta table.

Some of the benefits of Delta Time Travel are:

Historical Data Analysis

Rollback to the previous version in case of new data quality is not valid

Supports Schema Evolution

Click through for examples of each of these.

Comments closed

Looping through Lakehouses in Microsoft Fabric Spark Jobs

Published 2023-12-20 by Kevin Feasel

Dennes Torres builds a loop:

I have published videos and articles before about Lakehouse maintenance. In this article I want to address a missing point for a lot of Fabric administrators: How to do maintenance on multiple lakehouses that are located in different workspaces.

One of the videos I have published explains the maintenance of multiple lakehouses, but only addresses maintenance in a single workspace. Is it a good idea to keep multiple lakehouses in the same workspace? Probably not.

Click through for the process.

Comments closed

Generating Fabric Delta Tables from Power BI Semantic Models

Published 2023-12-19 by Kevin Feasel

Nikola Ilic is excited:

A few days ago, while preparing materials for the customer training on Microsoft Fabric, I stumbled upon a very interesting article at Microsoft Learn. The article describes how to integrate Power BI semantic models (aka datasets) into OneLake.

At first glance, this doesn’t sound like something epic, but when I started thinking more and more about it, I realized that this really might be a huge thing in many different scenarios. First of all, at the moment of writing, this feature is still in preview – this means, it can change to some extent in the coming months, before eventually becoming GA. Nevertheless, I decided to take a shot and explore what can be done with OneLake integration for semantic models.

Read on to learn more about what this is doing and what you can do with it.

Comments closed

Delta Table Incremental Refresh in Power BI

Published 2023-12-18 by Kevin Feasel

Chris Webb shows off a bit of functionality:

One of the coolest features in Fabric is Direct Lake mode, which allows you to build Power BI reports directly on top of Delta tables in your data lake without having to wait for a semantic model to refresh. However not everyone is ready for Fabric yet so there’s also a lot of interest in the new DeltaLake.Table M function which allows Power Query (in semantic models or dataflows) to read data from Delta tables. If you currently have a serving layer – for example Synapse Serverless or Databricks SQL Warehouse – in between your existing lake house and your import mode Power BI semantic models then this new function could allow you to remove it, to reduce complexity and cut costs. This will only be a good idea, though, if refresh performance isn’t impacted and incremental refresh can be made to work well.

Click through to learn more about the performance of this operation and how it all works.

Comments closed

Uploading Data into Microsoft Fabric

Published 2023-12-14 by Kevin Feasel

Eugene Meidinger continues learning about Azure Fabric:

I want to preface that a lot of the issues I run into below are because of my own ignorance around the tooling, and a lot of the detail I include is to show what that ignorance looks like, since many people reading this might be used to Fabric or at least data engineering.

Read on to learn about Eugene’s mistakes so you don’t make them yourself.

Comments closed

An Overview of Lakehouses in Microsoft Fabric

Published 2023-12-11 by Kevin Feasel

Kevin Chant invites you to a swank lakhouse:

By the end of this post, you will have a good overview of Microsoft Fabric Data Lakehouses, including CI/CD options. In addition, where your SQL Server background can prove to be useful and where some Power BI knowledge can come in handy.

Plus, I share plenty of links in this post. For instance, there are a couple of useful resources to help you get started towards the bottom of this post.

Click through for the article.

Comments closed

All about Lakehouses in Microsoft Fabric

Published 2023-12-07 by Kevin Feasel

Tomaz Kastrun gives us the skinny with multiple posts in his Advent of Microsoft Fabric. Day 3 introduces the lakehouse:

Lakehouse is cost-effective and optimised storage, supporting all types of data and file formats, structured and unstructured data, and helps you govern the data, giving you better data governance. With optimised and concurrent reads and writes, it gives outstanding performance by also reducing data movement and minimising redundant copy operations. Furthermore, it gives you a user-friendly multitasking experience in UI with retaining your context, not losing your running operations and working on multiple things, without accidentally stopping others.

Day 4 covers Delta format:

Yesterday we looked into lakehouse and learned that Delta tables are the storing format. So, let’s explore what and how we can go around understanding and working with delta tables. But first we must understand delta lake.

Day 5 covers data ingest:

We have learned about delta lake and delta tables. But since we have uploaded the file directly, let’s explore, how we can also get the data into lakehouse.

Click through for all three posts.

Comments closed

Lakehouse Management in Fabric via mssparkutils

Published 2023-11-21 by Kevin Feasel

Sandeep Pawar scripts out some lakehouse work:

At MS Ignite, Microsoft unveiled a variety of new APIs designed for working with Fabric items, such as workspaces, Spark jobs, lakehouses, warehouses, ML items, and more. You can find detailed information about these APIs here. These APIs will be critical in the automation and CI/CD of Fabric workloads.

With the release of these APIs, a new method has been added to the mssparkutils library to simplify working with lakehouses. In this blog, I will explore the available options and provide examples. Please note that at the time of writing this blog, the information has not been published on the official documentation page, so keep an eye on the documentation for changes.

This looks to be quite useful for CI/CD work.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Category: Data Lake