Press "Enter" to skip to content

Category: Data Lake

Delta Lake Support in Azure Stream Analytics

Emma An makes an announcement:

Delta Lake has gained popularity in recent times due to its unique features and advantages over traditional data warehouse and other storage formats. For those already using traditional data storage format or moving to a lakehouse architecture, Delta Lake can offer several compelling benefits that can further enhance the performance and capabilities of their data pipelines. Many Azure services are integrated with Delta Lake, and now you can use Azure Stream Analytics to write in Delta format.

In this blog, we will explain the native support of Delta Lake in Azure Stream Analytics, that can help users take their workload to the next level, providing a seamless and scalable solution for large-scale data processing and storage. It is easy to start, taking only a few clicks to create an end-to-end pipeline, and write to either a new or existing Delta table stored in Azure Data Lake Storage Gen2.

This is a nice addition to Stream Analytics and Emma shows two ways you can write out results in Delta Lake format.

Comments closed

Automated Delta Lake Maintenance in Synapse

Shalu Ganotra Chadha, et al, explain how to keep your Delta Lake tidy:

The useful features of Delta Lake come at the cost of requiring regular maintenance. Delta Lake requires periodic cleanup as it accumulates files over time with each upsert and retains previous snapshots of the data. They can quickly convert a small dataset (in MBs) to several GBs of storage. This is because deleted data is not really removed but retained as an older snapshot of the Delta Lake dataset.

Click through for two operations you can perform on a Delta Lake, as well as some recommendations on when to do what via the Genie Delta Lake Auto Maintenance scripts they provide.

Comments closed

Thoughts on the Data Lakehouse Concept

Teo Lachev shares some thoughts:

There has been a lot of noise surrounding a data lakehouse nowadays, so I felt the urge to chime in. In fact, the famous guy in cube, Patrick LeBlanc, gave a great presentation on this subject to our Atlanta Power BI Group and you can find the recording here (I have to admit we could have done better job with the recording quality, but we are still learning in the post-COVID era).

A lot of Teo’s thoughts relate to the data lake portion of the “lakehouse” metaphor and they’re worth reading. One small comment I’ll make is regarding the following statement:

And should you stage the data 1:1 from the source? In some cases, like the Get Data First aforementioned scenario, it might make sense. But in most cases, it would be much more efficient to stage the data in the shape you need it, which may necessitate joining multiple tables at the source (by the way, a relational server is the best place to handle joins).

The biggest benefit to getting data as “raw” as possible comes from the expectation that upstream data is ephemeral. What I mean is that, if you pull data on 2023-02-08, you’ll get some dataset. On 2023-04-08, you won’t be able to re-create that original dataset because your source will have changed, with inserts, updates, and deletes modifying the source.

Therefore, you want your copy of data into the Raw/Bronze layer to be as “opinion-free” as possible. This is because, if your data access code has a bug in it, you’ll never be able to re-create the old data. By contrast, if you have that raw data, you can start from there, re-apply your transformations, and replace your downstream results with corrected details. This is similar to the way a write-once service like Apache Kafka can benefit: you get history where there was none before.

So that’s the trade-off on the other side. Yeah, it’s typically more efficient to create queries and operate on the data before it goes into Raw, and you might need to do so for practical reasons, but there’s a risk of that code causing an irrevocable data problem and the risk goes up as you get more complicated operations.

Comments closed

Use Cases for Multiple Data Lakes

James Serra explains why you might want multiple data lakes in an organization:

A question I get asked frequently from customers when discussing Data lake architecture is “Should I use one data lake for all my data, or multiple lakes?”. Ideally, you would use just one data lake, but I have seen many valid use cases where customers are using multiple data lakes. Here are some of those reasons:

I’d quibble with a couple of these (and given James’s intro, I’m not sure he’s fully on board with all of the reasons) but this is a good list of reasons why you might see several data lakes in an organization.

Comments closed

Reading the Data Lake with the Serverless Pool via OPENROWSET

Ryan Adams begins a series on reading data from the data lake:

There are two ways to read data inside Data Lake using the Synapse Serverless engine.  In this article, we’ll look at the first method which uses OPENROWSET to query a path within the lake. 

Synapse is a collection of tools with four different analytical engines (Dedicated PoolSpark PoolServerless PoolData Explorer Pool).  This gives you a lot of options for ingesting, transforming, storing, and querying your data.  The article will focus on how you can use the Synapse Serverless Pool to query the data in your ADLS account.   

Click through for a primer on the topic, as well as a demo video.

Comments closed

External Objects in Databricks Unity Catalog

Meagan Longoria adds external tables and views to an Azure Databricks Unity Catalog:

I’ve been busy defining objects in my Unity Catalog metastore to create a secure exploratory environment for analysts and data scientists. I’ve found a lack of examples for doing this in Azure with file types other than delta (maybe you’re reading this in the future and this is no longer a problem, but it was when I wrote this). So I wanted to get some more examples out there in case it helps others.

I’m not storing any data in Databricks – I’m leaving my data in the data lake and using Unity Catalog to put a tabular schema on top of it (hence the use of external tables vs managed tables. In order to reference an ADLS account, you need to define a storage credential and an external location.

Read on for examples of what you can do with this.

Comments closed

Capturing Event Hubs Data in Delta Lake Format with Stream Analytics

Xu Jiang announces a public preview:

The Stream Analytics no-code editor is a drag and drop design tool that helps customers to develop the Stream Analytics jobs without writing a single line of code. The experience provides a canvas that allows you to connect to input sources to quickly see your streaming data. Then you can transform and preview it before writing to your destination of choice in Azure. To learn more, see No-code stream processing through Azure Stream Analytics | Microsoft Learn.

Read on to see how you can capture and process data into Delta Lake format via their designer.

Comments closed

Isolated Spark Testing with lakeFS

Adi Polak demonstrates lakeFS:

This tutorial demonstrates how to build a development and testing environment for validating your logic on a full-blown production data volume and variety, working with lakeFS and Spark. You will walk through the journey of creating a repository and building a Spark application while using lakeFS capabilities. You will learn how to data changes, revert them in cases of mistakes or other hiccups, and lately merge separate branches to reflect data changes from the isolated environments.

Not too long ago, I had a couple conversations with developers and data engineers about decentralized development and devs having their own environments and data. This seems like it would be a good approach to that common problem, and it works for Azure Synapse Analytics as well.

Comments closed

Data Lake Exploration in AWS with Athena for Spark

Pathik Shah and Raj Devnath jetski the data lake:

Amazon Athena now enables data analysts and data engineers to enjoy the easy-to-use, interactive, serverless experience of Athena with Apache Spark in addition to SQL. You can now use the expressive power of Python and build interactive Apache Spark applications using a simplified notebook experience on the Athena console or through Athena APIs. For interactive Spark applications, you can spend less time waiting and be more productive because Athena instantly starts running applications in less than a second. And because Athena is serverless and fully managed, analysts can run their workloads without worrying about the underlying infrastructure.

Data lakes are a common mechanism to store and analyze data because they allow companies to manage multiple data types from a wide variety of sources, and store this data, structured and unstructured, in a centralized repository. Apache Spark is a popular open-source, distributed processing system optimized for fast analytics workloads against data of any size. It’s often used to explore data lakes to derive insights. For performing interactive data explorations on the data lake, you can now use the instant-on, interactive, and fully managed Apache Spark engine in Athena. It enables you to be more productive and get started quickly, spending almost no time setting up infrastructure and Spark configurations.

In this post, we show how you can use Athena for Apache Spark to explore and derive insights from your data lake hosted on Amazon Simple Storage Service (Amazon S3).

This feels a lot like the Spark pool in Azure Synapse Analytics, as well as some of what Databricks does

Comments closed