Press "Enter" to skip to content

Category: Spark

Dotnet-Spark UDFs and Missing Shared State

Ed Elliott uncovers a mystery:

To understand this we need to take a look at how we can create a UDF in .NET that is called by the Java VM Apache Spark code because, that is logically, what happens. In our application we call into Apache Spark and ask it to do things like read from a file, run some transformation and write files back out again. With UDF’s, we ask Spark to run a UDF and Spark comes back to our UDF, passing it some data and asks the UDF to execute but the Java VM does not understand how to execute .NET code.

Read the whole thing.

Comments closed

Using Key Vault in Azure Databricks

Arun Sirpal shows us how easy it is to tie Azure Key Vault into Azure Databricks:

The key vault should always be a core component of your Azure design because we can store keys, secrets, certicates thus abstract / hide the true connection string within files. When working with databricks to mount storage to ingest your data and query it ideally you should be leveraging this to create secrets and secret scopes.

Click through for a demo.

Comments closed

Outlier Identification Using Spark 3.0

Tori Tompkins takes us through principles of anomaly detection in Apache Spark 3.0:

To calculate Median Absolute Deviation (MAD) you need to calculate the difference between the value and the median. In simpler terms, you will need to calculate the median of the entire dataset, the difference between each value and this median, then take another median of all the differences.

In Spark you can use a SQL expression ‘percentile()’ to calculate any medians or quartiles in a dataframe. ‘percentile()’ expects a column and an array of percentiles to calculate (for median we can provide ‘array(0.5)’ because we want the 50% value ie median) and will return an array of results.

Like standard deviation, to use MAD to identify the outliers it needs to be a certain number of MAD’s away. This number is also referred to as the threshold and is defaulted to 3.

Read on for three measures and their implementations in PySpark.

Comments closed

Features and Improvements in Spark 3.0

Manoj Pandey summarizes some of the improvements in Apache Spark 3.0:

With Spark 3.0 release (on June 2020) there are some major improvements over the previous releases, some of the main and exciting features for Spark SQL & Scala developers are AQE (Adaptive Query Execution), Dynamic Partition Pruning and other performance optimization and enhancements.

Below I’ve listed out these new features and enhancements all together in one page for better understanding and future reference.

Click through for the summary.

Comments closed

Adaptive Query Execution in Databricks

MaryAnn Xue and Allison Wang explain how Adaptive Query Execution works with Databricks:

One of the most important cost-based decisions made in the Spark optimizer is the selection of join strategies, which is based on the size estimation of the join relations. But since this estimation can go wrong in both directions, it can either result in a less efficient join strategy because of overestimation, or even worse, out-of-memory errors because of underestimation.

AQE offers a trouble-free solution here by switching to the faster broadcast hash join during execution time.

This is pretty similar to Adaptive Query Processing in SQL Server.

Comments closed

The Spark Starter Guide

Landon Robinson has some good news for us:

If you visit hadoopsters.com/spark or thesparkguide.com, you’ll see something new and exciting from us. It’s official: we’ve written and are publishing a comprehensive guide to Apache Spark.

This guide will be completely online and completely free. A book’s worth of content, containing exercises in Python and Scala to teach you Spark, at your fingertips. Again, free.

Landon has posted chapter 1, section 1 already:

This section introduces the concept of data pipelines – how data is processed from one form into another. It’s also the generic term used to describe how data moves from one location or form, and is consumed, altered, transformed, and delivered to another location or form.

You’ll be introduced to Spark functions like joinfilter, and aggregate to process data in a variety of forms. You’ll learn it all through interactive Spark exercises in Scala and Python.

This is very early in the process but I’m excited.

Comments closed

The Main Components of Apache Spark

Manoj Pandey walks us through the key components in Apache Spark:

1. Spark Driver:

– The Driver program can run various operations in parallel on a Spark cluster.

– It is responsible to communicate with the Cluster Manager for allocation of resources for launching Spark Executors.

– And in parallel it instantiates SparkSession for the Spark Application.

– The Driver program splits the Spark Application into one or more Spark Jobs, and each Job is transformed into a DAG (Directed Acyclic Graph, aka Spark execution plan). Each DAG internally has various Stages based upon different operations to perform, and finally each Stage gets divided into multiple Tasks such that each Task maps to a single partition of data.

– Once the Cluster Manager allocates resources, the Driver program works directly with the Executors by assigning them Tasks.
 

Click through for additional elements and how they fit together.

Comments closed

Optical Character Recognition with Tesseract and Databricks

Alex Aleksandrov takes a look at optical character recognition with the Tesseract library:

The topic of Optical Character Recognition (OCR) is not an unexplored field to the Adatis audience. Some Adati like Kalina Ivanova (link1link2) and Francesco Sbrescia (link3) have already explored this topic from the perspective of Azure Cognitive Services and Azure Data Lake. In my first blog, I would like to explore this topic from a different perspective: using Tesseract and Databricks.

Click through for instructions.

Comments closed

Querying Multiple Data Sources in Azure Synapse Analytics

James Serra walks us through querying Data Lake Storage Gen2, Cosmos DB, and a table created in an Azure Synapse serverless Apache Spark pool:

As I was finishing up a demo script for my presentation at the SQL PASS Virtual Summit on 11/13 (details on my session here), I wanted to blog about part of the demo that shows a feature in the public preview of Synapse that is frankly, very cool. It is the ability to query data as it sits in ADLS Gen2, a Spark table, and Cosmos DB and join the data together with one T-SQL statement using SQL on-demand (also called SQL serverless), hence making it a federated query (also known as data virtualization). The beauty of this is you don’t have to first write ETL to collect all the data into a relational database in order to be able to query it all together, and don’t have to provision a SQL pool, saving costs. Further, you are using T-SQL to query all of those data sources so you are able to use a reporting tool like Power BI to see the results.

Click through to see how.

Comments closed