Press "Enter" to skip to content

Category: Spark

Push-Based Shuffle in Apache Spark 3.2 via Project Magnet

Venkata Krishnan Sowrirajan and Min Shen announce that Project Magnet will be in Apache Spark 3.2:

Push-based shuffle is an implementation of shuffle where the shuffle blocks are pushed to the remote shuffle services from the mapper tasks in order to address shuffle scalability and reliability issues. In a nutshell, with push-based shuffle, a large number of small, random reads is converted into a small number of large, sequential reads, which significantly improves disk I/O efficiency and shuffle data locality.

This is explained in greater detail in an earlier blog post, Magnet: A scalable and performant shuffle architecture for Apache Spark, which you can read for more information about how we achieve push-based shuffle.

Read on to see when this matters and how you can make use of it once you’re in Spark 3.2 (whose first release was exactly two weeks ago, October 13th).

Comments closed

SQL User-Defined Functions in Spark SQL

Serge Rielau and Allison Wang announce a new type of user-defined function in Spark SQL:

SQL UDFs are simple yet powerful extensions to Spark SQL. As functions, they provide a layer of abstraction to simplify query construction – making SQL queries more readable and modularized. Unlike UDFs that are written in a non-SQL language, SQL UDFs are more lightweight for SQL users to create. SQL function bodies are transparent to the query optimizer thus making them more performant than external UDFs. SQL UDFs can be created as either temporary or permanent functions, be reused across multiple queries, sessions and users, and be access-controlled via Access Control Language (ACL). In this blog, we will walk you through some key use cases of SQL UDFs with examples.

I look forward to dealing with cardinality issues and performance tuning these things in 5 years.

Comments closed

Session Windows in Spark Structured Streaming

Jungtaek Lim, et al, announce support for session windows in Spark Structured Streaming:

Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. An input can only be bound to a single window.

Sliding windows are similar to the tumbling windows from the point of being “fixed-sized”, but windows can overlap if the duration of the slide is smaller than the duration of the window, and in this case, an input can be bound to the multiple windows.

Session windows have a different characteristic compared to the previous two types. Session window has a dynamic size of the window length, depending on the inputs. A session window starts with an input and expands itself if the following input has been received within the gap duration. A session window closes when there’s no input received within the gap duration after receiving the latest input. This enables you to group events until there are no new events for a specified time duration (inactivity).

Click through for more details. You could implement session windows when querying existing data using a gaps and islands approach (where you increment the island count when you have a lagged difference greater than the cutoff point), but for streaming scenarios, it’s very handy to have this as a native window type.

Comments closed

Creating Delta Lake Tables in Azure Databricks

Gauri Mahajan takes us through creating new tables in a Delta Lake using Azure Databricks:

Delta lake is an open-source data format that provides ACID transactions, data reliability, query performance, data caching and indexing, and many other benefits. Delta lake can be thought of as an extension of existing data lakes and can be configured per the data requirements. Azure Databricks has a delta engine as one of the core components that facilitates delta lake format for data engineering and performance. Delta lake format is used to create modern data lake or lakehouse architectures. It is also used to build a combined streaming and batch architecture popularly known as lambda architecture.

Click through for the process.

Comments closed

Architecting a Jenkins Replacement

Li Haoyi takes us through an internal Databricks tool for continuous integration:

Runbot is a bespoke continuous integration (CI) solution developed specifically for Databricks’ needs. Originally developed in 2019, Runbot incrementally replaces our aging Jenkins infrastructure with something more performant, scalable, and user friendly for both users and maintainers of the service. This blog post will explore the motivations behind developing Runbot, the core design decisions that went into it, and how we used it to greatly improve the experience of all the developers within the Databircks engineering organization.

It doesn’t look like the tool is available externally, but it’s an interesting read and helps understand some of the “why” behind the solution.

Comments closed

Databricks Integration with Git Repos

Ka-Hing Chueng and Vaibhav Sethi announce Databricks Repos is now generally available:

Thousands of Databricks customers have adopted Databricks Repos since its public preview and have standardized on it for their development and production workflows. Today, we are happy to announce that Databricks Repos is now generally available.

Databricks Repos was created to solve a persistent problem for data teams: most tools used by data engineering/machine learning practitioners offer poor or no integration with Git version control systems, forcing them to navigate through multiple files, steps and UIs to simply review and commit code. Not only is this time-consuming, but it’s also error-prone.

This has been a bit of a pain point with Databricks in the past, and they’ve come up with this solution. Given that Azure Synapse Analytics has some of the same pain points, I’d expect we’ll see something similar in time.

Comments closed

New in SQL Server Big Data Clusters

Daniel Coelho has an update on what’s available in SQL Server Big Data Clusters:

SQL Server Big Data Clusters (BDC) is a capability brought to market as part of the SQL Server 2019 release. Big Data Clusters extends SQL Server’s analytical capabilities beyond in-database processing of transactional and analytical workloads by uniting the SQL engine with Apache Spark and Apache Hadoop to create a single, secure, and unified data platform. It is available exclusively to run on Linux containers, orchestrated by Kubernetes, and can be deployed in multiple-cloud providers or on-premises.

Today, we’re proud to announce the release of the latest cumulative update, CU13, for SQL Server Big Data Clusters which includes important changes and capabilities:

Updating to the most recent production-ready version of Spark (as of today) is a nice upgrade.

Comments closed

pyspark.pandas in Apache Spark 3.2

Hyukjin Kwon and Xinrong Meng announce a built-in pandas API for Apache Spark 3.2:

We’re thrilled to announce the pandas API as part of the upcoming Apache Spark™ 3.2 release. pandas is a powerful, flexible library and has grown rapidly to become one of the standard data science libraries. Now pandas users can leverage the pandas API on their existing Spark clusters.

A few years ago, we launched Koalas, an open source project that implements the pandas DataFrame API on top of Spark, which became widely adopted among data scientists. Recently, Koalas was officially merged into PySpark by SPIP: Support pandas API layer on PySpark as part of Project Zen (see also Project Zen: Making Data Science Easier in PySpark from Data + AI Summit 2021).

pandas users can now scale their workloads with one simple line change in the upcoming Spark 3.2 release:

Click through to see more details on the change.

Comments closed