Press "Enter" to skip to content

Category: Data Lake

Data Hubs, Warehouses, and Lakes

Trevor Legg compares and contrasts data hubs, data warehouses, and data lakes:

Data hubs, data warehouses, and data lakes are significant investment areas for data and analytics leaders and are vital to support increasingly complex, distributed, and varied data workloads.

Gartner finds that 57% of data and analytics leaders are investing in data warehouses, 46% are using data hubs, and 39% are using data lakes. However, they also found that these same data and analytics leaders don’t necessarily understand the difference between the three…

To best support specific business requirements, it’s vital to understand the difference and purpose of each type of structure, and the role it can play in modern data management infrastructure.

Click through for the definitions and comparisons.

Comments closed

Reading Delta Lake Tables from Power BI

Gerhard Brueckl checks out the Apache Parquet connector in Power BI, reading from a Delta Lake:

“Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.”

However, Parquet is just a file format and does not really support you when it comes to data management. Common data manipulation operations (DML)  like updates and deletes still need to be handled manually by the data pipeline. This was one of the reasons why Delta Lake (delta.io) was developed besides a lot of other features like ACID transactions, proper meta data handling and a lot more. If you are interested in the details, please follow the link above.

Click through for a demo.

Comments closed

Living in the Lakehouse

James Serra defines the term “data lakehouse”:

As a follow-up to my blog Data Lakehouse & Synapse, I wanted to talk about the various definitions I am seeing about what a data lakehouse is, including a recent paper by Databricks.

Databricks uses the term “Lakehouse” in their paper (see Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics), which argues that the data warehouse architecture as we know it today will wither in the coming years and be replaced by a new architectural pattern, the Lakehouse. Instead of the two-tier data lake + relational data warehouse model, you will just need a data lake, which is made possible by implementing data warehousing functionality over open data lake file formats.

While I agree there may be some uses cases where technical designs may allow Lakehouse systems to completely replace relational data warehouses, I believe those use cases are much more limited than this paper suggests.

James is a sharp and perceptive fellow, so read the whole thing.

Comments closed

Q&A about the Lakehouse

Terry McCann posts Q&A from Simon Whiteley’s session on Lakehouse models in Spark 3.0:

“WHILE ALL THE HADOOP PROVIDERS PROMOTED THE DATALAKE PARADIGM BACK THEN, HOW THE INDUSTRY AND THE OTHER DATA LAKE PROVIDERS ARE SHIFTING TO/CONSIDERING THE LAKE HOUSE PARADIGM?“

It’s a direction that most providers are heading in, albeit under the “unified analytics” or “modern warehouse” name rather than the “lakehouse”. But most big relational engines are moving to bring in spark/big data capabilities, other lake providers are looking to expand their SQL coverage. It’s a bit of a race to who gets to the “can do both sides as well as a specialist tool” point first. Will we see other tools championing it as a “lakehouse”, or is that term now tied too closely as a “vendor-specific” term coming from Databricks? We’ll see…

Click through for some good questions and thoughtful answers.

Comments closed

The Evolving Lakehouse

Simon Whiteley looks at the current status of the Lakehouse model:

We have discussed in the past this idea of the lakehouse, the aspirational target of many analytics platforms these days of combining the huge power and potential of data lakes with the rigour, reliability and concurrency of a data warehouse. It’s an interesting concept but has, in the past, been firmly an aspiration.

In the world without lakehouses, we often see the “Modern Data Warehouse”, this two-phased approach to providing a holistic platform – we load our early data into a lake where we shape it and massage it into an understandable state. It is here we perform data science, exploratory data analysis, early sight analytics prototyping and various other functions that don’t quite fit into a data warehouse… but then we load our data into a relational store for serving to the business. This is where we can meet their demands for a rich SQL environment, auditable data models and rigorous change procedures. Essentially, we store data twice so that we can achieve the best of both worlds.

Definitely read Simon’s take on it. My take is that the Lakehouse concept will start to be useful to specific companies in about 2-3 years, as I don’t think the performance is there today.

Comments closed

Querying Data Lake Files in Power BI through Synapse Analytics

Wolfgang Strasser shows us how to integrate Azure Synapse Analytics and Power BI:

Sometimes however, would not it be nice to access the data lake in Direct Query mode – to get the most up to date information for every report view? I would say: yes … but how can you achieve this? The options natively provided by ADLS Gen2 and Power BI are not sufficient to solve this requirement. But: there are options to achieve this and, in this post, I would like to show you the possibilities using Azure Synapse Analytics to build a query layer on top of a ADLS Gen2 storage account.

Click through for a step-by-step walkthrough.

Comments closed

Delta Lake DML Internals

Tathagata Das, et al, take us through how Delta Lake handles update, delete, and merge operations:

`DELETE` works just like `UPDATE` under the hood. Delta Lake makes two scans of the data: the first scan is to identify any data files that contain rows matching the predicate condition. The second scan reads the matching data files into memory, at which point Delta Lake deletes the rows in question before writing out the newly clean data to disk.

After Delta Lake completes a `DELETE` operation successfully, the old data files are not deleted — they’re still retained on disk, but recorded as “tombstoned” (no longer part of the active table) in the Delta Lake transaction log. Remember, those old files aren’t deleted immediately because you might still need them to time travel back to an earlier version of the table. If you want to delete files older than a certain time period, you can use the `VACUUM` command.

Click through for a video as well as a blog post with the details.

Comments closed

Cloning Delta Lakes

Burak Yavuz and Pranav Anand show us how to clone Delta Lakes:

Clones are replicas of a source table at a given point in time. They have the same metadata as the source table: same schema, constraints, column descriptions, statistics, and partitioning. However, they behave as a separate table with a separate lineage or history. Any changes made to clones only affect the clone and not the source. Any changes that happen to the source during or after the cloning process also do not get reflected in the clone due to Snapshot Isolation. In Databricks Delta Lake we have two types of clones: shallow or deep.

Read on to learn the differences, as well as a few useful scenarios.

Comments closed

Spark SQL in Delta Lake

Kundan Kumarr walks us through some of the basic SQL operations you can perform with Delta Lake in Apache Spark:

Nowadays Delta lake is a buzz word in the Big Data world, especially among the spark developers because it relegates lots of issues found in the Big Data domain. Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It is evolving day by day and adds cool features in its every release. On 19th June 2020, Delta lake version 0.7.0 was released and this is the first release on Spark 3.x. This release involves important key features that can make the spark developer’s work easy.

One of the interesting key features in this release is the support for metastore-defined tables and SQL DDLs. So now we can define Delta tables in the Hive metastore and use the table name in all SQL operations. We can perform SQL DDLs to create tables, insert into tables, explicitly alter the schema of the tables, and so on. So in this blog, we will learn how we can perform SQL DDLs/DMLS/DQL in Delta Lake 0.7.0.

Click through for the examples.

Comments closed

Raw Data in the Data Lake

Steve Cardella uses wrestling as a metaphor where I would have used sewage:

Raw. Unfiltered. Data. The raw zone – it’s the dark underbelly of your data lake, where anything can happen. The CRM data just body-slammed the accounting data, while the HR data is taking a chair to the marketing data. It’s all a rumble for the championship belt, right? Oh, wait – we’re talking data lakes. Sorry. If the raw zone isn’t where data goes to duke it out, then what is the raw zone of a data lake? How should it be set up?

First, let’s take a time-out to give some context. A data lake is a central storage pool for enterprise data; we pour information into it from all kinds of sources. Those sources might include anything from databases to raw audio and video footage, in unstructured, semi-structured, and structured formats. A data warehouse, conversely, only houses structured data. The data lake is divided into one or more zones of data, with varying degrees of transformation and cleanliness (see this video for more: Data Lake Zones, Topology, and Security). The raw zone is the foundation upon which all other data lake zones are built.

Read on to understand the importance of raw data in a data lake, and the equal importance of making sure end users don’t see that stuff very often. Also, Steve gets bonus points for using my favorite term for the Aristotelian opposite of a data lake: the data swamp.

Comments closed