Press "Enter" to skip to content

Category: Spark

Right to Be Forgotten in Delta Lake

Milos Colic, et al, tackle a tricky problem:

With Delta, we have one more tool at our disposal to address GDPR compliance and, in particular, “the right to be forgotten” – VACUUM. Vacuum operation removes the files that are no longer needed and that are older than a predefined retention period. The default retention period is 30 days to align with GDPR definition of undue delay. Our earlier blog on a similar topic explains in detail how you can find and delete personal information related to a consumer by running two commands:

The part I’m finding tricky here is, how does this handle “time travel” scenarios in which you’re looking at prior iterations of data? I haven’t run through all of the scenarios so this is just speculation, but it seems that even with all of these changes, you’d still have to worry about historical data containing that sensitive information.

Comments closed

Writing a Single JSON File in Databricks

Falek Miah performs a surprisingly difficult task:

When writing to a JSON destination using the DataFrameWriter the dataset is split into multiple files to reflect the number of RDD partitions in the dataframe when in memory – this is the most efficient way for Spark to write data out.

However, this creates a directory containing the data files, as well as Spark metadata files…but what if you just wanted a single JSON file? It’s a scenario that comes up a lot with our clients and, despite it not being the most efficient way to use Spark, we need to implement it all the same.

Click through to see how to do this, including the removal of all metadata files (committed, started, and success files).

Comments closed

Intelligent Cache for Spark in Synapse

Avinanda Chattapadday makes an announcement:

Traditionally, when querying a file or table from your data lake, the Apache Spark engine in Synapse makes a call to your remote ADLS Gen2 storage for each read of the data. For workloads with frequent repeat queries, this process can be redundant and add latency to the overall processing time. Although Apache Spark provides a great caching feature, it must be manually set and released to minimize the latency and improve overall performance. It can also result in queries of stale data if the underlying data changes. This is where the intelligent cache in Azure Synapse can simplify the process; by automatically detecting changes to the underlying files and automatically refreshing them in the cache, you ensure you have access to the most recent data. When the cache reaches its size limit, it will automatically release the least-read data to make space for more recent data.

Click through to see how you can enable this, as well as a few more details on the process.

Comments closed

Building a Recommender in Spark

Avinash Sooriyarachchi makes a recommendation:

There has been an exponential increase in the volume and variety of data at our disposal to build recommenders and notable advances in compute and algorithms to utilize in the process. Particularly, the means to store, process and learn from image data has dramatically increased in the past several years. This allows retailers to go beyond simple collaborative filtering algorithms and utilize more complex methods, such as image classification and deep convolutional neural networks, that can take into account the visual similarity of items as an input for making recommendations. This is especially important given online shopping is a largely visual experience and many consumer goods are judged on aesthetics.

In this article, we’ll change the script and show the end-to-end process for training and deploying an image-based similarity model that can serve as the foundation for a recommender system. Furthermore, we’ll show how the underlying distributed compute available in Databricks can help scale the training process and how foundational components of the Lakehouse, Delta Lake and MLflow, can make this process simple and reproducible.

Click through for the process.

Comments closed

Getting Started with the Databricks Feature Store

Gavita Regunath gives us an introduction to a useful Databricks feature:

Databricks announced the launch of the Databricks Feature Store last year, in May 2021. It is the first of its kind that has been co-designed with Delta Lake and MLflow to accelerate ML deployments.

In this article, we will leverage Databricks Feature Store to store features, create a training dataset by looking up relevant features, and subsequently train an ML model. Follow this step-by-step guide to get started on Databricks Feature Store.

Click through to learn more.

Comments closed

Apache Flink and Delta Lake

Max Fisher and Dylan Gessner use Flink to load data in Delta Lake:

As with all parts of our platform, we are constantly raising the bar and adding new features to enhance developers’ abilities to build the applications that will make their Lakehouse a reality. Building real-time applications on Databricks is no exception. Features like asynchronous checkpointingsession windows, and Delta Live Tables allow organizations to build even more powerful, real-time pipelines on Databricks using Delta Lake as the foundation for all the data that flows through the Lakehouse.

However, for organizations that leverage Flink for real-time transformations, it might appear that they are unable to take advantage of some of the great Delta Lake and Databricks features, but that is not the case. In this blog we will explore how Flink developers can build pipelines to integrate their Flink applications into the broader Lakehouse architecture.

Click through for two methods of doing so.

Comments closed

Run Spark within Azure ML Compute

James Nguyen makes an announcement:

Following the blog post on Turning AML compute into Ray and Dask , we’ve added a new exciting capability to run Spark within AML compute where Spark shares the same context with your ML code. The Spark version is 3.2.1 with support for Delta Lake and Synapse SQL read/write. This enables users of AML to perform powerful data transformation and even Spark ML within AML interactive notebook or in a job run. 

Traditionally, Azure ML integrates with Spark Synapse or external compute services via a pipeline step or better via magic command like %synapse, but the computing context is separate from your AML logic so you still need to run Spark in a separate step and persist the output to some storage and load it in your AML script.

With this approach, Spark is available right within your AML code whether it’s AML notebook, python script or pipeline step. It shares the common computing context and most of the cases you can just directly convert the Spark Dataframe to Pandas and Dask Dataframe without persisting first to an intermediary storage.

I’ll have to try this out to see if it makes up for their getting rid of the Spark-based curated environments last year.

Comments closed