Spark – Page 25 – Curated SQL

Broadcast variables are variables which are available in all executors executing the Spark application. These variables are already cached and ready to be used by tasks executing as part of the application. Broadcast variables are sent to the executors only once and it is available for all tasks executing in the executors.

Read on to understand when they are useful and, just as importantly, when not to use them. They seem like the type of thing which a newer developer could easily misuse.

Comments closed

Finding Duplicates in a Spark DataFrame

Published 2021-05-21 by Kevin Feasel

The Hadoop in Real World team shows how to deduplicate rows in a DataFrame in Spark:

It is a pretty common use case to find the list of duplicate elements or rows in a Spark DataFrame and it is very easy to do with a groupBy() and a count()

Where “easy” has as a modifier just how many columns you’re dealing with in the DataFrame.

Comments closed

reduceByKey and aggregateByKey in Spark

Published 2021-04-21 by Kevin Feasel

The Hadoop in Real World team compares two functions against RDDs in Spark:

Let’s examine the below aggregateByKey. The first parameter – 0 is the initial value and also indicates the type of the output.
First _+_ function indicates the function on the map side combine and second _+_ function indicates the reduce side combine. Both functions are the same in this case.

This is a demo-driven post, so check it out.

Comments closed

Querying Serverless SQL Pools from Spark Notebooks in Scala

Published 2021-04-19 by Kevin Feasel

Jovan Popovic shows off one integration point between the data services in Azure Synapse Analytics:

Azure Synapse Analytics provides multiple query runtimes that you can use to query in-database or external data. You have the choice to use T-SQL queries using a serverless Synapse SQL pool or notebooks in Apache Spark for Synapse analytics to analyze your data.
You can also connect these runtimes and run the queries from Spark notebooks on a dedicated SQL pool.
In this post, you will see how to create Scala code in a Spark notebook that executes a T-SQL query on a serverless SQL pool.

Read on to see how. You can also query Spark pool and dedicated SQL pool tables from serverless SQL pools.

4 Comments

Geospatial Fraud Detection

Published 2021-04-15 by Kevin Feasel

Antoine Amend uses Databricks to identify financial fraud in a geographical area:

As part of this real-world solution, we are releasing a new open source geospatial library, GEOSCAN, to detect geospatial behaviors at massive scale, track customers patterns over time and detect anomalous card transactions. Finally, we demonstrate how organizations can surface anomalies from an analytics environment to an online data store (ODS) with tight SLA requirements following a Lambda-like infrastructure underpinned by Delta Lake, Apache Spark and MLflow.

Click through for the article, as well as three notebooks.

Comments closed

SCD2 Dimensions on Spark with Apache Hudi

Published 2021-04-13 by Kevin Feasel

David Greenshtein shows how we can build type-2 slowly changing dimensions using Apache Hudi:

Implementing SCD2 in a data lake without using an additional framework like Apache Hudi introduces the challenge of updating data stored on immutable Amazon S3 storage, and as a result requires the implementor to create multiple copies of intermediate results. This situation may lead to a significant maintenance effort and potential data loss or data inconsistency.
Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. Hudi enables you to manage data at the record level in Amazon S3 and helps to handle data privacy use cases requiring record-level updates and deletes. Hudi is supported by Amazon EMR starting from version 5.28 and is automatically installed when you choose Spark, Hive, or Presto when deploying your EMR cluster.

Click through for the process.

Comments closed

Ordering and Sorting Data in Spark

Published 2021-04-07 by Kevin Feasel

Landon Robinson shows how to sort data in Spark RDDs and DataFrames:

In the analysis section of Spark Starter Guide 4.6: How to Aggregate Data, we asked these questions: “Who is the youngest cat in the data? Who is the oldest?”
Let’s use ordering in Spark as an alternative method to answer those same questions, and achieve the same result. Specifically, let’s again find the youngest and oldest cats in the data.

Click through for plenty of examples.

Comments closed

spkarlyr 1.6 Released

Published 2021-04-06 by Kevin Feasel

Carly Driggers announces a new release of sparklyr:

Sparklyr, an LF AI & Data Foundation Incubation Project, has released version 1.6! Sparklyr is an R Language package that lets you analyze data in Apache Spark, the well-known engine for big data processing, while using familiar tools in R. The R Language is widely used by data scientists and statisticians around the world and is known for its advanced features in statistical computing and graphics.

Click through to see the changes.

Comments closed

Caching versus Persisting in Spark

Published 2021-03-31 by Kevin Feasel

The Hadoop in Real World team explains a subtle difference:

cache() and persist() functions are used to cache intermediate results of a RDD or DataFrame or Dataset. You can mark an RDD, DataFrame or Dataset to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, the objects behind the RDD, DataFrame or Dataset on which cache() or persist() is called will be kept in memory or on the configured storage level on the nodes.

That’s the similarity, but click through for the difference.

Comments closed

Spark Performance in Azure Synapse Analytics

Published 2021-03-31 by Kevin Feasel

Euan Garden shares some numbers around Apache Spark performance in Azure Synapse Analytics:

To compare the performance, we derived queries from TPC-DS with 1TB scale and ran them on 8 nodes Azure E8V3 cluster (15 executors – 28g memory, 4 cores). Even though our version running inside Azure Synapse today is a derivative of Apache Spark™ 2.4.4, we compared it with the latest open-source release of Apache Spark™ 3.0.1 and saw Azure Synapse was 2x faster in total runtime for the Test-DS comparison.

Click through for several techniques the Azure Synapse Analytics team has implemented to make some significant performance improvements. It’s still slower than Databricks, but considerably faster than the open-source Apache Spark baseline.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Spark

Broadcast Variables in Apache Spark

Finding Duplicates in a Spark DataFrame

reduceByKey and aggregateByKey in Spark

Querying Serverless SQL Pools from Spark Notebooks in Scala

Geospatial Fraud Detection

SCD2 Dimensions on Spark with Apache Hudi

Ordering and Sorting Data in Spark

spkarlyr 1.6 Released

Caching versus Persisting in Spark

Spark Performance in Azure Synapse Analytics