Press "Enter" to skip to content

Category: Spark

Adaptive Query Execution in Databricks

MaryAnn Xue and Allison Wang explain how Adaptive Query Execution works with Databricks:

One of the most important cost-based decisions made in the Spark optimizer is the selection of join strategies, which is based on the size estimation of the join relations. But since this estimation can go wrong in both directions, it can either result in a less efficient join strategy because of overestimation, or even worse, out-of-memory errors because of underestimation.

AQE offers a trouble-free solution here by switching to the faster broadcast hash join during execution time.

This is pretty similar to Adaptive Query Processing in SQL Server.

Comments closed

The Spark Starter Guide

Landon Robinson has some good news for us:

If you visit hadoopsters.com/spark or thesparkguide.com, you’ll see something new and exciting from us. It’s official: we’ve written and are publishing a comprehensive guide to Apache Spark.

This guide will be completely online and completely free. A book’s worth of content, containing exercises in Python and Scala to teach you Spark, at your fingertips. Again, free.

Landon has posted chapter 1, section 1 already:

This section introduces the concept of data pipelines – how data is processed from one form into another. It’s also the generic term used to describe how data moves from one location or form, and is consumed, altered, transformed, and delivered to another location or form.

You’ll be introduced to Spark functions like joinfilter, and aggregate to process data in a variety of forms. You’ll learn it all through interactive Spark exercises in Scala and Python.

This is very early in the process but I’m excited.

Comments closed

The Main Components of Apache Spark

Manoj Pandey walks us through the key components in Apache Spark:

1. Spark Driver:

– The Driver program can run various operations in parallel on a Spark cluster.

– It is responsible to communicate with the Cluster Manager for allocation of resources for launching Spark Executors.

– And in parallel it instantiates SparkSession for the Spark Application.

– The Driver program splits the Spark Application into one or more Spark Jobs, and each Job is transformed into a DAG (Directed Acyclic Graph, aka Spark execution plan). Each DAG internally has various Stages based upon different operations to perform, and finally each Stage gets divided into multiple Tasks such that each Task maps to a single partition of data.

– Once the Cluster Manager allocates resources, the Driver program works directly with the Executors by assigning them Tasks.
 

Click through for additional elements and how they fit together.

Comments closed

Optical Character Recognition with Tesseract and Databricks

Alex Aleksandrov takes a look at optical character recognition with the Tesseract library:

The topic of Optical Character Recognition (OCR) is not an unexplored field to the Adatis audience. Some Adati like Kalina Ivanova (link1link2) and Francesco Sbrescia (link3) have already explored this topic from the perspective of Azure Cognitive Services and Azure Data Lake. In my first blog, I would like to explore this topic from a different perspective: using Tesseract and Databricks.

Click through for instructions.

Comments closed

Querying Multiple Data Sources in Azure Synapse Analytics

James Serra walks us through querying Data Lake Storage Gen2, Cosmos DB, and a table created in an Azure Synapse serverless Apache Spark pool:

As I was finishing up a demo script for my presentation at the SQL PASS Virtual Summit on 11/13 (details on my session here), I wanted to blog about part of the demo that shows a feature in the public preview of Synapse that is frankly, very cool. It is the ability to query data as it sits in ADLS Gen2, a Spark table, and Cosmos DB and join the data together with one T-SQL statement using SQL on-demand (also called SQL serverless), hence making it a federated query (also known as data virtualization). The beauty of this is you don’t have to first write ETL to collect all the data into a relational database in order to be able to query it all together, and don’t have to provision a SQL pool, saving costs. Further, you are using T-SQL to query all of those data sources so you are able to use a reporting tool like Power BI to see the results.

Click through to see how.

Comments closed

Spark Infer Schema vs ADF Get Metadata

Paul Andrew compares two techniques for retrieving metadata:

For file types that don’t contain there own metadata (CSV, Text etc) we typically have to go and figure out there structure including; attributes and data types before doing any actual transformation work. Often I’ve used the Data Factory Metadata Activity to do this with its structure option. However, while playing around with Azure Synapse Analytics, specifically creating Notebooks in C# to run against the Apache Spark compute pools I’ve discovered in most case the Data Frame infer schema option basically does a better job here.

Now, I’m sure some Spark people will probably read the above and think, well der, obviously Paul! Spark is better than Data Factory. And sure, I accept for this specific situation it certainly is. I’m simply calling that out as it might not be obvious to everyone

Read on for a comparison of the two techniques.

Comments closed

MLOps with Azure Databricks and MLflow

Oliver Koernig walks us through some of the basics of MLOps using MLflow and Azure Databricks:

Most organizations today have a defined process to promote code (e.g. Java or Python) from development to QA/Test and production.  Many are using Continuous Integration and/or Continuous Delivery (CI/CD) processes and oftentimes are using tools such as Azure DevOps or Jenkins to help with that process. Databricks has provided many resources to detail how the Databricks Unified Analytics Platform can be integrated with these tools (see Azure DevOps IntegrationJenkins Integration). In addition, there is a Databricks Labs project – CI/CD Templates – as well as a related blog post that provides automated templates for GitHub Actions and Azure DevOps, which makes the integration much easier and faster.

When it comes to machine learning, though, most organizations do not have the same kind of disciplined process in place.

Read on for a demonstration of the process.

Comments closed

Measuring Advertising Effectiveness

Layla Yang and Hector Leano walk us through measuring how effective an advertising campaign was:

At a high level we are connecting a time series of regional sales to regional offline and online ad impressions over the trailing thirty days. By using ML to compare the different kinds of measurements (TV impressions or GRPs versus digital banner clicks versus social likes) across all regions, we then correlate the type of engagement to incremental regional sales in order to build attribution and forecasting models. The challenge comes in merging advertising KPIs  such as impressions, clicks, and page views from different data sources with different schemas (e.g., one source might use day parts to measure impressions while another uses exact time and date; location might be by zip code in one source and by metropolitan area in another).

As an example, we are using a SafeGraph rich dataset for foot traffic data to restaurants from the same chain. While we are using mocked offline store visits for this example, you can just as easily plug in offline and online sales data provided you have region and date included in your sales data. We will read in different locations’ in-store visit data, explore the data in PySpark and Spark SQL, and make the data clean, reliable and analytics ready for the ML task. For this example, the marketing team wants to find out which of the online media channels is the most effective channel to drive in-store visits.A

Click through for the article as well as notebooks.

Comments closed

Persisting an RDD in Spark

Sarfaraz Hussain takes us through caching / persisting RDDs in Apache Spark:

Spark RDD persistence is an optimization technique which saves the result of RDD evaluation in cache memory. Using this we save the intermediate result so that we can use it further if required. It reduces the computation overhead.

When we persist an RDD, each node stores the partitions of it that it computes in memory and reuses them in other actions on that RDD (or RDD derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

Read on to see how you can do this and some of the options available to you when caching. This is extremely useful when working with external data sources, as then you don’t risk hitting the external source multiple times.

Comments closed