Category: Spark

A Primer on Apache Spark

Published 2021-12-02 by Kevin Feasel

Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally it was developed at the Berkeley’s AMPLab, and later donated to the Apache Software Foundation, which has maintained it since.

Click through to learn more about the product.

Comments closed

Managing Spark on Kubernetes instead of YARN

Published 2021-11-29 by Kevin Feasel

Rohit Choudhary argues that it’s a good idea to move from YARN to Kubernetes for your Spark clusters:

When it comes to data operations, Spark provides a tremendous advantage as a resource for data operations because it aligns with the things that make data ops valuable. It is optimized for machine learning and AI, which are used for batch processing (in real-time and at scale), and it is adept at operating within different types of environments.
Spark doesn’t completely manage these clusters of machines but instead uses a cluster manager (known as a scheduler). Most companies have traditionally used the Java Virtual Machine (JVM)-based Hadoop YARN to manage their clusters. But with the dramatic rise of Kubernetes and cloud-native computing, many organizations are moving away from YARN to Kubernetes to manage their Spark clusters. Spark on Kubernetes is even now generally available since the Apache Spark 3.1 release in March 2021.

I see some of the benefits there but am not totally sold, especially given the complexity of Kubernetes and its own lack of built-in security measures.

Comments closed

MMLSpark Is Now SynapseML

Published 2021-11-24 by Kevin Feasel

Mark Hamilton has an announcement:

Today, we’re excited to announce the release of SynapseML (previously MMLSpark), an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines. Building production-ready distributed ML pipelines can be difficult, even for the most seasoned developer. Composing tools from different ecosystems often requires considerable “glue” code, and many frameworks aren’t designed with thousand-machine elastic clusters in mind. SynapseML resolves this challenge by unifying several existing ML frameworks and new Microsoft algorithms in a single, scalable API that’s usable across Python, R, Scala, and Java.

Read on to learn more about the library.

Comments closed

Benchmarking Databricks vs Snowflake

Published 2021-11-18 by Kevin Feasel

Mostafa Mokhtar, et al, respond to some benchmarking claims:

On Nov 2, 2021, we announced that we set the official world record for the fastest data warehouse with our Databricks SQL lakehouse platform. These results were audited and reported by the official Transaction Processing Performance Council (TPC) in a 37-page document available online at tpc.org. We also shared a third-party benchmark by the Barcelona Supercomputing Center (BSC) outlining that Databricks SQL is significantly faster and more cost effective than Snowflake.
A lot has happened since then: many congratulations, some questions, and some sour grapes. We take this opportunity to reiterate that we stand by our blog post and the results: Databricks SQL provides superior performance and price performance over Snowflake, even on data warehousing workloads (TPC-DS).

Posts like this are exactly why getting rid of the DeWitt clause is important. I’d rather have Snowflake and Databricks duking it out with publicly-available and testable processes. When reading this, the most important part of this post was the several exhortations to try it out yourself, both for the Databricks test and the Snowflake test. Make benchmarking public, including hardware choices, configuration choices, and the testing process; then, I can tell for sure if your benchmark makes sense for my use case.

1 Comment

GPU-Accelerated Analysis on Databricks using PyTorch + Huggingface

Published 2021-11-01 by Kevin Feasel

Srijith Rajamohan walks us through an example of sentiment analysis using the PyTorch and Huggingface libraries on Databricks:

Sentiment analysis is commonly used to analyze the sentiment present within a body of text, which could range from a review, an email or a tweet. Deep learning-based techniques are one of the most popular ways to perform such an analysis. However, these techniques tend to be very computationally intensive and often require the use of GPUs, depending on the architecture and the embeddings used. Huggingface (https://huggingface.co) has put together a framework with the transformers package that makes accessing these embeddings seamless and reproducible. In this work, I illustrate how to perform scalable sentiment analysis by using the Huggingface package within PyTorch and leveraging the ML runtimes and infrastructure on Databricks.

Click through for a description of the process, as well as a link to a notebook you can walk through yourself.

Comments closed

Getting Started with Sparks in Azure Synapse Analytics

Published 2021-10-29 by Kevin Feasel

Hiram Fleitas has a guide for us:

Step 1 watch this video
Step 2 skim through these slides for more context:
The rest is all hands-on stuff – if you get stuck at any point lmk.

Click through for an overview video from Euan Garden and several resources and tutorials.

Comments closed

Push-Based Shuffle in Apache Spark 3.2 via Project Magnet

Published 2021-10-27 by Kevin Feasel

Venkata Krishnan Sowrirajan and Min Shen announce that Project Magnet will be in Apache Spark 3.2:

Push-based shuffle is an implementation of shuffle where the shuffle blocks are pushed to the remote shuffle services from the mapper tasks in order to address shuffle scalability and reliability issues. In a nutshell, with push-based shuffle, a large number of small, random reads is converted into a small number of large, sequential reads, which significantly improves disk I/O efficiency and shuffle data locality.
This is explained in greater detail in an earlier blog post, Magnet: A scalable and performant shuffle architecture for Apache Spark, which you can read for more information about how we achieve push-based shuffle.

Read on to see when this matters and how you can make use of it once you’re in Spark 3.2 (whose first release was exactly two weeks ago, October 13th).

Comments closed

SQL User-Defined Functions in Spark SQL

Published 2021-10-21 by Kevin Feasel

Serge Rielau and Allison Wang announce a new type of user-defined function in Spark SQL:

SQL UDFs are simple yet powerful extensions to Spark SQL. As functions, they provide a layer of abstraction to simplify query construction – making SQL queries more readable and modularized. Unlike UDFs that are written in a non-SQL language, SQL UDFs are more lightweight for SQL users to create. SQL function bodies are transparent to the query optimizer thus making them more performant than external UDFs. SQL UDFs can be created as either temporary or permanent functions, be reused across multiple queries, sessions and users, and be access-controlled via Access Control Language (ACL). In this blog, we will walk you through some key use cases of SQL UDFs with examples.

I look forward to dealing with cardinality issues and performance tuning these things in 5 years.

Comments closed

Session Windows in Spark Structured Streaming

Published 2021-10-19 by Kevin Feasel

Jungtaek Lim, et al, announce support for session windows in Spark Structured Streaming:

Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. An input can only be bound to a single window.
Sliding windows are similar to the tumbling windows from the point of being “fixed-sized”, but windows can overlap if the duration of the slide is smaller than the duration of the window, and in this case, an input can be bound to the multiple windows.
Session windows have a different characteristic compared to the previous two types. Session window has a dynamic size of the window length, depending on the inputs. A session window starts with an input and expands itself if the following input has been received within the gap duration. A session window closes when there’s no input received within the gap duration after receiving the latest input. This enables you to group events until there are no new events for a specified time duration (inactivity).

Click through for more details. You could implement session windows when querying existing data using a gaps and islands approach (where you increment the island count when you have a lagged difference greater than the cutoff point), but for streaming scenarios, it’s very handy to have this as a native window type.

Comments closed

Creating Delta Lake Tables in Azure Databricks

Published 2021-10-18 by Kevin Feasel

Gauri Mahajan takes us through creating new tables in a Delta Lake using Azure Databricks:

Delta lake is an open-source data format that provides ACID transactions, data reliability, query performance, data caching and indexing, and many other benefits. Delta lake can be thought of as an extension of existing data lakes and can be configured per the data requirements. Azure Databricks has a delta engine as one of the core components that facilitates delta lake format for data engineering and performance. Delta lake format is used to create modern data lake or lakehouse architectures. It is also used to build a combined streaming and batch architecture popularly known as lambda architecture.

Click through for the process.

Comments closed