Press "Enter" to skip to content

Category: Spark

Named Entity Encryption in Spark

Arshad Ali wants to secure some data being used in a Synapse Spark pool:

As a data engineer, we often get requirements to encrypt, decrypt, mask, or anonymize certain columns of data in files sitting in the data lake when preparing and transforming data with Apache Spark. The extensibility feature of Spark allows us to leverage a library which is not native to Spark. One such library is Microsoft Presidio, which provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data, and more. It facilitates both fully automated and semi-automated PII (Personal Identifiable Information) de-identification and anonymization flows on multiple platforms.

In this blog post, I am going to demonstrate step by step how to download and use this library to meet the above requirements with Spark pool of Azure Synapse Analytics.

Read on to see how it works.

Comments closed

Spark Query Optimization in Synapse

Daniel Coelho lays out a few optimizations in Azure Synapse Analytics Spark pools:

The Azure Synapse Analytics team has prominent engineers enhancing and contributing back to the Apache Spark project. One of our focus areas is Spark query optimization techniques, where Microsoft has decades of experience and is making significant contributions to the Apache Spark open source engine.

The attachment at the bottom of this blog post will be presented at the 48th International Conference on Very Large Databases (#VLDB2022) and covers the latest developments in query optimization for Apache Spark 3. Those optimizations were developed by Microsoft engineers and are available today in the Azure Synapse runtime for Apache Spark versions 3.1 and 3.2.

Check out the high-level updates as well as a complete technical paper laying out the changes.

Comments closed

Creating Multiple Output Files per Spark Task

Dmitry Tolpeko has a quick but helpful post:

It is highly recommended that you try to evenly distribute the work among multiple tasks so every task produces a single output file and job is completed in parallel.

But sometimes it still may be useful when a task generates multiple output files with the limited number of records in each file […]

I had to cut it off right there to keep from spilling the beans here. Click through for Dmitry’s post to see what setting controls records per file, allowing you to keep opening those Spark output files in Excel.

Comments closed

Overly Large Executors in ElasticMapReduce

Dmitry Tolpeko notes a change to Amazon ElasticMapReduce:

So 50 executors were initially requested with the required memory 22528 and 4 vcores as expected, but actually 9 executors were created with 112640 memory and 20 cores that is 5x larger. It should have created 10 executors but my cluster does not have resources to run more containers.

Note: The second log row specifies allocated vCores:5, it is because of using DefaultResourceCalculator in my YARN cluster that ignores CPU and uses memory resource only. Do not pay attention to this, the Spark executor will still use 20 cores as it reported in the third log record above.

Click through for the reason.

Comments closed

Watermarking in Spark Structured Streaming

Max Fisher takes us through an important feature for Spark streaming:

When building real-time pipelines, one of the realities that teams have to work with is that distributed data ingestion is inherently unordered. Additionally, in the context of stateful streaming operations, teams need to be able to properly track event time progress in the stream of data they are ingesting for the proper calculation of time-window aggregations and other stateful operations. We can solve for all of this using Structured Streaming.

For example, let’s say we are a team working on building a pipeline to help our company do proactive maintenance on our mining machines that we lease to our customers. These machines always need to be running in top condition so we monitor them in real-time. We will need to perform stateful aggregations on the streaming data to understand and identify problems in the machines.

This is where we need to leverage Structured Streaming and Watermarking to produce the necessary stateful aggregations that will help inform decisions around predictive maintenance and more for these machines.

Read on to see how watermarking works in various scenarios, including when you join together streams.

Comments closed

Useful Design Patterns for Apache Spark Projects

Alexander Eleseev applies some design patterns:

When I participated in a big data project, I needed to program Spark applications to move and transform data from/to relational and distributed databases, like Apache Hive. I found such applications to have a number of pitfalls, so all “hard to read code,” “method is too large to fit into a single screen,” etc. problems need to be avoided for us to focus on deeper issues. Also, Spark jobs are similar: data is loaded from a single or multiple databases, gets transformed, then saved to a single or multiple databases. So it seems reasonable to try to use GoF patterns to program Spark applications. 

Specifically, this covers Spark code written in Java (or Python). I’d argue that Scala-based code would profit by following a different set of functional patterns rather than Gang of Four object-oriented design patterns.

Comments closed

Anomaly Detection over Delta Live Tables

Avinash Sooriyarachchi and Sathish Gangichetty show off an interesting scenario:

Anomaly detection poses several challenges. The first is the data science question of what an ‘anomaly’ looks like. Fortunately, machine learning has powerful tools to learn how to distinguish usual from anomalous patterns from data. In the case of anomaly detection, it is impossible to know what all anomalies look like, so it’s impossible to label a data set for training a machine learning model, even if resources for doing so are available. Thus, unsupervised learning has to be used to detect anomalies, where patterns are learned from unlabelled data.

Even with the perfect unsupervised machine learning model for anomaly detection figured out, in many ways, the real problems have only begun. What is the best way to put this model into production such that each observation is ingested, transformed and finally scored with the model, as soon as the data arrives from the source system? That too, in a near real-time manner or at short intervals, e.g. every 5-10 minutes? This involves building a sophisticated extract, load, and transform (ELT) pipeline and integrating it with an unsupervised machine learning model that can correctly identify anomalous records. Also, this end-to-end pipeline has to be production-grade, always running while ensuring data quality from ingestion to model inference, and the underlying infrastructure has to be maintained.

Click through to see their solution using Databricks and delta lake.

Comments closed

Creating Identity Columns in Databricks

Franco Patano generates some identity integers:

Identity columns solve the issues mentioned above and provide a simple, performant solution for generating surrogate keys. Delta Lake is the first data lake protocol to enable identity columns for surrogate key generation.

Delta Lake now supports creating IDENTITY columns that can automatically generate unique, auto-incrementing ID numbers when new rows are loaded. While these ID numbers may not be consecutive, Delta makes the best effort to keep the gap as small as possible. You can use this feature to create surrogate keys for your data warehousing workloads easily.

This is a bit light on explanation, unfortunately. With distributed systems, generating identities is historically tricky (especially with several independent nodes generating values) so I’d be curious to see how it works: do they allocate blocks of IDs to worker nodes or do something else? And are the IDs guaranteed to be monotonically increasing? Or is there some other service which “labels” the data upon insert and provides those IDs?

Comments closed

Serverless Compute for Databricks SQL

Nikhil Jethava and Shankar Sivadasan make an announcement:

We are excited to announce the preview of Serverless compute for Databricks SQL (DBSQL) on Azure Databricks. DBSQL Serverless makes it easy to get started with data warehousing on the lakehouse. Serverless compute for DBSQL helps address challenges customers face with cluster startup time, capacity management, and infrastructure costs:

Click through for more details and a short video. Azure Synapse Analytics and Databricks are definitely going head-to-head in the modern data warehousing space and I’m fine with that—hopefully it makes both products better as a result.

Comments closed

Security Practices for Delta Sharing

Andrew Weaver, et al, share some advice:

When you enable Delta Sharing, you configure the token lifetime for recipient credentials. If you set the token lifetime to 0, recipient tokens never expire.

Setting the appropriate token lifetime is critically important for regulatory, compliance and reputational standpoint. Having a token that never expires is a huge risk; therefore, it is recommended using short-lived tokens as best practice. It is far easier to grant a new token to a recipient whose token has expired than it is to investigate the use of a token whose lifetime has been improperly set.

Click through for eight such tips.

Comments closed