Press "Enter" to skip to content

Category: Spark

SCD2 Dimensions on Spark with Apache Hudi

David Greenshtein shows how we can build type-2 slowly changing dimensions using Apache Hudi:

Implementing SCD2 in a data lake without using an additional framework like Apache Hudi introduces the challenge of updating data stored on immutable Amazon S3 storage, and as a result requires the implementor to create multiple copies of intermediate results. This situation may lead to a significant maintenance effort and potential data loss or data inconsistency.

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. Hudi enables you to manage data at the record level in Amazon S3 and helps to handle data privacy use cases requiring record-level updates and deletes. Hudi is supported by Amazon EMR starting from version 5.28 and is automatically installed when you choose Spark, Hive, or Presto when deploying your EMR cluster.

Click through for the process.

Comments closed

Ordering and Sorting Data in Spark

Landon Robinson shows how to sort data in Spark RDDs and DataFrames:

In the analysis section of Spark Starter Guide 4.6: How to Aggregate Data, we asked these questions: “Who is the youngest cat in the data? Who is the oldest?”

Let’s use ordering in Spark as an alternative method to answer those same questions, and achieve the same result. Specifically, let’s again find the youngest and oldest cats in the data.

Click through for plenty of examples.

Comments closed

spkarlyr 1.6 Released

Carly Driggers announces a new release of sparklyr:

Sparklyr, an LF AI & Data Foundation Incubation Project, has released version 1.6! Sparklyr is an R Language package that lets you analyze data in Apache Spark, the well-known engine for big data processing, while using familiar tools in R. The R Language is widely used by data scientists and statisticians around the world and is known for its advanced features in statistical computing and graphics. 

Click through to see the changes.

Comments closed

Caching versus Persisting in Spark

The Hadoop in Real World team explains a subtle difference:

cache() and persist() functions are used to cache intermediate results of a RDD or DataFrame or Dataset. You can mark an RDD, DataFrame or Dataset to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, the objects behind the RDD, DataFrame or Dataset on which cache() or persist() is called will be kept in memory or on the configured storage level on the nodes. 

That’s the similarity, but click through for the difference.

Comments closed

Spark Performance in Azure Synapse Analytics

Euan Garden shares some numbers around Apache Spark performance in Azure Synapse Analytics:

To compare the performance, we derived queries from TPC-DS with 1TB scale and ran them on 8 nodes Azure E8V3 cluster (15 executors – 28g memory, 4 cores). Even though our version running inside Azure Synapse today is a derivative of Apache Spark™ 2.4.4, we compared it with the latest open-source release of Apache Spark™ 3.0.1 and saw Azure Synapse was 2x faster in total runtime for the Test-DS comparison.

Click through for several techniques the Azure Synapse Analytics team has implemented to make some significant performance improvements. It’s still slower than Databricks, but considerably faster than the open-source Apache Spark baseline.

Comments closed

Spark on Windows Subsystem for Linux 2

Gavin Campbell tries out Spark on Linux on Windows:

I’m not a frequent user of Windows, but I understand getting dependencies installed for local development can sometimes be a bit of a pain. I’m using an Azure VM, but these instructions should work on a regular Windows 10 installation. Since I’m not a “Windows Insider”, I followed the manual steps here to get WSL installed, then upgrade to WSL2. The steps are reproduced here for convenience:

Click through for the installation steps and the process.

Comments closed

Apache Spark 3.1 Released

Hyukjin Kwon, et al, announce Apache Spark 3.1:

Various new SQL features are added in this release. The widely used standard CHAR/VARCHAR data types are added as variants of the supported String types. More built-in functions (e.g., width_bucket (SPARK-21117) and regexp_extract_all (SPARK-24884) were added. The current number of built-in operators/functions has now reached 350. More DDL/DML/utility commands have been enhanced, including INSERT (SPARK-32976), MERGE (SPARK-32030) and EXPLAIN (SPARK-32337). Starting from this release, in Spark WebUI, the SQL plans are presented in a simpler and structured format (i.e. using EXPLAIN FORMATTED)

There have been quite a few advancements around the SQL side.

Comments closed

Installing Spark on Windows Subsystem for Linux

David Alcock wants Spark, but not Windows Spark:

The post won’t cover any instructions for installing Ubuntu and instead I’ll assume you’ve installed already and downloaded the tgz file from the Apache Spark download page (Step 3 in the above link).

Let’s go straight into the terminal window and get going! I’ve put the commands in bold text (don’t include the $) just so anyone can see a bit easier and who also prefers to ignore my jibberish! 

Click through for the instructions.

Comments closed

Power BI Connector for Databricks

Stefania Leone, et al, announce general availability of the Power BI connector for Databricks:

We are excited to announce General Availability (GA) of the Microsoft Power BI connector for Databricks for Power BI Service and Power BI Desktop 2.85.681.0. Following the public preview, we have already seen strong customer adoption, so we are pleased to extend these capabilities to our entire customer base. The native Power BI connector for Databricks in combination with the recently launched SQL Analytics service provides Databricks customers with a first-class experience for performing BI workloads directly on their Delta Lake. SQL Analytics allows customers to operate a multi-cloud lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance than traditional cloud data warehouses.

This is easier to work with than the Apache Spark connector and it looks like it should be faster than that connector as well.

Comments closed