Press "Enter" to skip to content

Category: Data Lake

Living in the Lakehouse

James Serra defines the term “data lakehouse”:

As a follow-up to my blog Data Lakehouse & Synapse, I wanted to talk about the various definitions I am seeing about what a data lakehouse is, including a recent paper by Databricks.

Databricks uses the term “Lakehouse” in their paper (see Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics), which argues that the data warehouse architecture as we know it today will wither in the coming years and be replaced by a new architectural pattern, the Lakehouse. Instead of the two-tier data lake + relational data warehouse model, you will just need a data lake, which is made possible by implementing data warehousing functionality over open data lake file formats.

While I agree there may be some uses cases where technical designs may allow Lakehouse systems to completely replace relational data warehouses, I believe those use cases are much more limited than this paper suggests.

James is a sharp and perceptive fellow, so read the whole thing.

Comments closed

Q&A about the Lakehouse

Terry McCann posts Q&A from Simon Whiteley’s session on Lakehouse models in Spark 3.0:

“WHILE ALL THE HADOOP PROVIDERS PROMOTED THE DATALAKE PARADIGM BACK THEN, HOW THE INDUSTRY AND THE OTHER DATA LAKE PROVIDERS ARE SHIFTING TO/CONSIDERING THE LAKE HOUSE PARADIGM?“

It’s a direction that most providers are heading in, albeit under the “unified analytics” or “modern warehouse” name rather than the “lakehouse”. But most big relational engines are moving to bring in spark/big data capabilities, other lake providers are looking to expand their SQL coverage. It’s a bit of a race to who gets to the “can do both sides as well as a specialist tool” point first. Will we see other tools championing it as a “lakehouse”, or is that term now tied too closely as a “vendor-specific” term coming from Databricks? We’ll see…

Click through for some good questions and thoughtful answers.

Comments closed

The Evolving Lakehouse

Simon Whiteley looks at the current status of the Lakehouse model:

We have discussed in the past this idea of the lakehouse, the aspirational target of many analytics platforms these days of combining the huge power and potential of data lakes with the rigour, reliability and concurrency of a data warehouse. It’s an interesting concept but has, in the past, been firmly an aspiration.

In the world without lakehouses, we often see the “Modern Data Warehouse”, this two-phased approach to providing a holistic platform – we load our early data into a lake where we shape it and massage it into an understandable state. It is here we perform data science, exploratory data analysis, early sight analytics prototyping and various other functions that don’t quite fit into a data warehouse… but then we load our data into a relational store for serving to the business. This is where we can meet their demands for a rich SQL environment, auditable data models and rigorous change procedures. Essentially, we store data twice so that we can achieve the best of both worlds.

Definitely read Simon’s take on it. My take is that the Lakehouse concept will start to be useful to specific companies in about 2-3 years, as I don’t think the performance is there today.

Comments closed

Querying Data Lake Files in Power BI through Synapse Analytics

Wolfgang Strasser shows us how to integrate Azure Synapse Analytics and Power BI:

Sometimes however, would not it be nice to access the data lake in Direct Query mode – to get the most up to date information for every report view? I would say: yes … but how can you achieve this? The options natively provided by ADLS Gen2 and Power BI are not sufficient to solve this requirement. But: there are options to achieve this and, in this post, I would like to show you the possibilities using Azure Synapse Analytics to build a query layer on top of a ADLS Gen2 storage account.

Click through for a step-by-step walkthrough.

Comments closed

Delta Lake DML Internals

Tathagata Das, et al, take us through how Delta Lake handles update, delete, and merge operations:

`DELETE` works just like `UPDATE` under the hood. Delta Lake makes two scans of the data: the first scan is to identify any data files that contain rows matching the predicate condition. The second scan reads the matching data files into memory, at which point Delta Lake deletes the rows in question before writing out the newly clean data to disk.

After Delta Lake completes a `DELETE` operation successfully, the old data files are not deleted — they’re still retained on disk, but recorded as “tombstoned” (no longer part of the active table) in the Delta Lake transaction log. Remember, those old files aren’t deleted immediately because you might still need them to time travel back to an earlier version of the table. If you want to delete files older than a certain time period, you can use the `VACUUM` command.

Click through for a video as well as a blog post with the details.

Comments closed

Cloning Delta Lakes

Burak Yavuz and Pranav Anand show us how to clone Delta Lakes:

Clones are replicas of a source table at a given point in time. They have the same metadata as the source table: same schema, constraints, column descriptions, statistics, and partitioning. However, they behave as a separate table with a separate lineage or history. Any changes made to clones only affect the clone and not the source. Any changes that happen to the source during or after the cloning process also do not get reflected in the clone due to Snapshot Isolation. In Databricks Delta Lake we have two types of clones: shallow or deep.

Read on to learn the differences, as well as a few useful scenarios.

Comments closed

Spark SQL in Delta Lake

Kundan Kumarr walks us through some of the basic SQL operations you can perform with Delta Lake in Apache Spark:

Nowadays Delta lake is a buzz word in the Big Data world, especially among the spark developers because it relegates lots of issues found in the Big Data domain. Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It is evolving day by day and adds cool features in its every release. On 19th June 2020, Delta lake version 0.7.0 was released and this is the first release on Spark 3.x. This release involves important key features that can make the spark developer’s work easy.

One of the interesting key features in this release is the support for metastore-defined tables and SQL DDLs. So now we can define Delta tables in the Hive metastore and use the table name in all SQL operations. We can perform SQL DDLs to create tables, insert into tables, explicitly alter the schema of the tables, and so on. So in this blog, we will learn how we can perform SQL DDLs/DMLS/DQL in Delta Lake 0.7.0.

Click through for the examples.

Comments closed

Raw Data in the Data Lake

Steve Cardella uses wrestling as a metaphor where I would have used sewage:

Raw. Unfiltered. Data. The raw zone – it’s the dark underbelly of your data lake, where anything can happen. The CRM data just body-slammed the accounting data, while the HR data is taking a chair to the marketing data. It’s all a rumble for the championship belt, right? Oh, wait – we’re talking data lakes. Sorry. If the raw zone isn’t where data goes to duke it out, then what is the raw zone of a data lake? How should it be set up?

First, let’s take a time-out to give some context. A data lake is a central storage pool for enterprise data; we pour information into it from all kinds of sources. Those sources might include anything from databases to raw audio and video footage, in unstructured, semi-structured, and structured formats. A data warehouse, conversely, only houses structured data. The data lake is divided into one or more zones of data, with varying degrees of transformation and cleanliness (see this video for more: Data Lake Zones, Topology, and Security). The raw zone is the foundation upon which all other data lake zones are built.

Read on to understand the importance of raw data in a data lake, and the equal importance of making sure end users don’t see that stuff very often. Also, Steve gets bonus points for using my favorite term for the Aristotelian opposite of a data lake: the data swamp.

Comments closed

Data Lakes for Smaller Projects

Thomas Spicer explains that your data lake doesn’t need to be enormous to be useful:

We recently wrote an article debunking common myths about data lake architectures, data lake definitions, and data lake analytics. It is called What is a Data Lake? Get A Leg Up Avoiding The Biggest Myths.” In that article, we framed the current conversation about data lakes and how they fit within enterprise data strategies. This topic has historically been confusing and opaque for those wanting to get value from a data lake due to conflicting advice from consultants and vendors.  

One area that can be particularly confusing is the perception that lakes are only for “big data.” If you spend any time reading materials on lakes, you would think there is only one type and it would look like the Capsian Sea (it’s a lake despite “sea” in the name). People describe data lakes as massive, all-encompassing entities, designed to hold all knowledge. The good news is that lakes are not just for “big data” and you have more opportunities than ever to have them be part of your data stack.

Click through for Thomas’s argument.

Comments closed

Working with ADLS Gen 2 in Power Query

Marco Russo takes us through some ways to optimize performance when working with Azure Data Lake Storage Gen 2 from Power Query:

With Power Query you can apply filters to the list obtained by the File System View option, thus restricting the access to only those files (or a single file) you are interested in. However, there is no query folding of this filter. What happens is that every time you refresh the data source, the list of all these files is read by Power Query; the filters in M Query to the folder path and the file name are then applied to this list only client-side. This is very expensive because the entire list is also downloaded when the expression is initially evaluated just to get the structure of the result of the transformation.

A better way to manage the process is to specify in the URL the complete folder path to traverse the hierarchy, and get only the files that are interesting for the transformation – or the exact path of the file is you are able to do that. For example, the data lake I used had one file for each day, stored in a folder structure organized by yyyy\mm, so every folder holds up to 31 files (one month).

Read on for more advice in this vein.

Comments closed