Data Lake – Page 2 – Curated SQL

Writing Data into a Microsoft Fabric Lakehouse via Notebook

Published 2025-03-12 by Kevin Feasel

Since Lakehouse is one of the key items within Microsoft Fabric, it is important to know how to write data into it in various formats and using different tools. One of the most common tools is notebooks, as they provide great flexibility and speed for development and testing with graphical outputs. In this article, I want to focus primarily on the following types of notebooks:

PySpark

Python

Click through to see how it works in both notebook types.

Comments closed

Table Compaction in Apache Spark

Published 2025-02-27 by Kevin Feasel

Miles Cole groups things together:

If there anything that data engineers agree about, it’s that table compaction is important. Often one of the first big lessons that folks will learn early on is that not compacting tables can present serious performance issues: you’ve gotten your lakehouse pilot approved and it’s been running for a couple months in production and you find that both reads and writes are increasingly getting slower and slower while your data volumes have not increased drastically. Guess what, you almost surely have a “small file problem”.

What engineers won’t always sing the same tune on is how and when to perform table compaction.

Read on for a dive into the power of compaction (converting a large number of small files into a small number of large files) and plenty of tips along the way.

Comments closed

Data Retention for Data in the Microsoft Fabric Lakehouse

Published 2025-01-29 by Kevin Feasel

Kenneth Omorodion clears out some data:

More than before, organizations now aim for a well-defined approach to manage their data storage effectively. Some reasons for this include operational efficiency, cost management, regulatory compliance, and strategic decision-making. In this article, I will describe an approach on data retention management for Lakehouse files to manage data storage when the data exists as files in the Fabric Lakehouse.

There’s nothing built in but Kenneth makes it easy.

Comments closed

Data Lakes, Warehouses, and Lakehouses

Published 2025-01-06 by Kevin Feasel

Noa Shavit disambiguates three terms:

A data warehouse is a repository and platform for storing, querying, and manipulating data. Warehouses are particularly suited for structured data used for decision support and business intelligence. Modern data warehouses have become more efficient, flexible, and scalable (particularly in the context of massively parallel processing and distributed computation), but they still bear the mark of their early development in the previous century.

The data warehouse concept dates back to data marts in the 1970s. After a long incubation period, the idea began to bear fruit commercially at IBM in the late 1980s and early 1990s. Data warehousing improved on the inefficiency of data marts, siloed data stores maintained by individual departments.

Click through to learn more about each of the three concepts and how they relate.

Comments closed

Implementing a Star Schema in a Microsoft Fabric Lakehouse

Published 2024-09-30 by Kevin Feasel

Nikola Ilic builds a lakehouse:

But, what is a star schema in the first place? I have good and bad news for you:)…The bad news is that I’m not covering it in this article because this one focuses on explaining how to implement a star schema in Fabric L akehouse (assuming that you already know what star schema is). The good news is: I’ve already written about it, so go and read this article first, if you’re not sure what star schema represents in the world of data modeling…

In one of the previous articles, I also shown how to implement a star schema in Power BI, by leveraging Power Query Editor.

Now, let’s get our hands dirty and build a star schema by using PySpark in the Fabric notebook!

Click through to see how.

Comments closed

Microsoft Fabric Direct Lake and Reframing Operations

Published 2024-09-10 by Kevin Feasel

Reza Rad changes the frame:

Power BI offers a new type of connection to Microsoft Fabric Lakehouse or Warehouse, called Direct Lake. The Direct Lake connection acts like DirectQuery and won’t need the data to be refreshed. However, the Power BI semantic model has refresh settings that can be turned on or off. In this article and video, you will learn about the Refresh settings for the Power BI semantic model that is connected using a Direct Lake connection, what that is, and why it is called Reframe.

Read on to learn more, or to check out the video.

Comments closed

Finding Columns in Memory in Power BI Direct Lake Mode

Published 2024-08-20 by Kevin Feasel

Chris Webb goes searching:

As you probably know, in Power BI Direct Lake mode column data is only loaded into memory when it is needed by a query. I gave a few examples of this – and how to monitor it using DMVs – in this blog post from last year. But which columns are loaded into memory in which circumstances? I was thinking about this recently and realised I didn’t know for sure, so I decided to do some tests. Some of the results were obvious, some were a surprise.

Read on for the answer.

Comments closed

Building Real-Time Dashboards from Lakehouse Data in Microsoft Fabric

Published 2024-08-09 by Kevin Feasel

Dennes Torres gets around a limitation:

Real-Time dashboards are a great feature in Real Time Intelligence experience to monitor our data. However, by default it’s made to work only with Kusto Databases. The options to create a real time dashboard or to define its data source only accept Kusto Databases.

What if we would like to see in real time the information we have in a lakehouse as well? Let’s discover a solution for this.

Read on for the solution.

Comments closed

Microsoft Fabric Lakehouse Access Control

Published 2024-08-07 by Kevin Feasel

Koen Verbeeck lets us into the lakehouse:

We’re doing a proof-of-concept with Microsoft Fabric, building our data model in a lakehouse. We’d like to give people access to the data inside it so they can build their own reports with whatever tool they want. Is there an easy way to share access to a lakehouse (preferably not by giving access to the entire workspace)?

Read on to learn how.

Comments closed

Reading a Lakehouse Table from another Microsoft Fabric Workspace

Published 2024-08-07 by Kevin Feasel

Gilbert Quevauvilliers spans the gap:

I was doing some work recently for a customer and they had data stored in different Lakehouse’s which was in a different App Workspace.

I was pleasantly surprised that this can be quite easy to do.

In my example below I am going to show you how in my notebook I can read a table in a Lakehouse table when it is not attached to any Lakehouse.

It’s good that this is so easy to do, considering that current advice leans toward having multiple workspaces and not cramming everything into one.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Data Lake