Press "Enter" to skip to content

Category: Spark

Spark RDD Transformations

Meenakshi Goyal walks us through the transformation functions available to you when using a Spark RDD:

The role of transformation in Spark is to create a new dataset from an existing one. Lazy transformations are those that are computed only when an action requires a result to be returned to the driver programme.

When we call an action, transformations are executed since they are inherently lazy. Not right away are they carried out. There are two primary types of transformations: map() and filter ().
The outcome RDD is always distinct from the parent RDD after the transformation. It could be smaller (filter, count, distinct, sample, for example), bigger (flatMap(), union(), Cartesian()), or the same size (e.g. map).

Read on to learn more about transformations, including examples of how each works. Even if you’re using the DataFrames API for Spark, it’s still important to understand that transformations are lazy.

Comments closed

REST APIs for Synapse Spark Pools

Abid Nazir Guroo looks at some endpoints:

Azure Synapse Analytics Representational State Transfer (REST) APIs are secure HTTP service endpoints that support creating and managing Azure Synapse resources using Azure Resource Manager and Azure Synapse web endpoints. This article provides instructions on how to setup and use Synapse REST endpoints and describe the Apache Spark Pool operations supported by REST APIs.

Read on to see some of the Spark pool management options are available to you via the REST API.

Comments closed

Time Travel with Delta Tables in Synapse

Liliam Leme reverses the clock:

Scenario

While working with a customer, they had a requirement to restore modified files to a specific point in time. They had built their architecture on top of a Data lake.

Looking for options

While working on this scenario, we explored some storage options available without any side customization, for example, Soft delete for blobs – Azure Storage | Microsoft Docs.

Read on to see what they landed on.

Comments closed

Azure Synapse Analytics R Language Support

Ryan Majidimehr has a short list of updates for Azure Synapse Analytics but it includes a big one:

Azure Synapse Analytics provides built-in R support for Apache Spark. As part of this, data scientists can leverage Azure Synapse Analytics notebooks to write and run their R code. This also includes support for SparkR and SparklyR, which allows users to interact with Spark using familiar Spark or R interfaces. To learn more read the official how-to Use R for Apache Spark with Azure Synapse Analytics (Preview).

That it took this long for R support was a bit weird, but I’m glad it’s there now.

Comments closed

Choosing between Synapse Spark Notebooks or Job Definitions

Arun Sethia and Arshad Ali explain when you might use a Spark notebook versus a job definition:

Synapse Spark Notebook is a web-based (HTTP/HTTPS) interactive interface to create files that contain live code, narrative text, and visualizes output with rich libraries for spark based applications. Data engineers can collaborate, schedule, run, and test their spark application code using Notebooks. Notebooks are a good place to validate ideas and do quick experiments to get insight into the data. You can integrate the Synapse Notebook into Synapse pipeline.

The Notebook allows you to combine programming code with markdown text and perform simple visualizations (using Synapse Notebook chart options and open-source libraries). In addition, running code will supply immediate feedback, output, and progress tracking within Notebook.

Click through for the comparison.

Comments closed

Transferring Data between Dedicated SQL and Spark Pools in Synapse

Sidney Cirqueria shows off a connector available to us in Azure Synapse Analytics:

Usually, customers do this kind of operation using Synapse Apache Spark to load data to Dedicated Pool within Azure Synapse Workspace, but today, I would like to reproduce a different scenario that I was working on one of my support cases.  Consider a scenario where you are trying to load data from Synapse Spark to Dedicated pool (formerly SQL DW) using Synapse Pipelines, and additionally you are using Synapse Workspace deployed with Managed Virtual Network.

The intention of this guide is to help you with which configuration will be required if you need to load data from Azure Synapse Apache Spark to Dedicated SQL Pool (formerly SQL DW). If you prefer take advantage of the new feature-rich capabilities now available via the Synapse workspace and Studio and load data directly from Azure Apache Spark to Dedicated Pool in Azure Synapse Workspace is recommended that you enable Synapse workspace features on an existing dedicated SQL pool (formerly SQL DW).

Read on for a few tips a nd a step-by-step walkthrough of the process.

Comments closed

Named Entity Encryption in Spark

Arshad Ali wants to secure some data being used in a Synapse Spark pool:

As a data engineer, we often get requirements to encrypt, decrypt, mask, or anonymize certain columns of data in files sitting in the data lake when preparing and transforming data with Apache Spark. The extensibility feature of Spark allows us to leverage a library which is not native to Spark. One such library is Microsoft Presidio, which provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data, and more. It facilitates both fully automated and semi-automated PII (Personal Identifiable Information) de-identification and anonymization flows on multiple platforms.

In this blog post, I am going to demonstrate step by step how to download and use this library to meet the above requirements with Spark pool of Azure Synapse Analytics.

Read on to see how it works.

Comments closed

Spark Query Optimization in Synapse

Daniel Coelho lays out a few optimizations in Azure Synapse Analytics Spark pools:

The Azure Synapse Analytics team has prominent engineers enhancing and contributing back to the Apache Spark project. One of our focus areas is Spark query optimization techniques, where Microsoft has decades of experience and is making significant contributions to the Apache Spark open source engine.

The attachment at the bottom of this blog post will be presented at the 48th International Conference on Very Large Databases (#VLDB2022) and covers the latest developments in query optimization for Apache Spark 3. Those optimizations were developed by Microsoft engineers and are available today in the Azure Synapse runtime for Apache Spark versions 3.1 and 3.2.

Check out the high-level updates as well as a complete technical paper laying out the changes.

Comments closed

Creating Multiple Output Files per Spark Task

Dmitry Tolpeko has a quick but helpful post:

It is highly recommended that you try to evenly distribute the work among multiple tasks so every task produces a single output file and job is completed in parallel.

But sometimes it still may be useful when a task generates multiple output files with the limited number of records in each file […]

I had to cut it off right there to keep from spilling the beans here. Click through for Dmitry’s post to see what setting controls records per file, allowing you to keep opening those Spark output files in Excel.

Comments closed