Press "Enter" to skip to content

Category: Spark

Architecting a Jenkins Replacement

Li Haoyi takes us through an internal Databricks tool for continuous integration:

Runbot is a bespoke continuous integration (CI) solution developed specifically for Databricks’ needs. Originally developed in 2019, Runbot incrementally replaces our aging Jenkins infrastructure with something more performant, scalable, and user friendly for both users and maintainers of the service. This blog post will explore the motivations behind developing Runbot, the core design decisions that went into it, and how we used it to greatly improve the experience of all the developers within the Databircks engineering organization.

It doesn’t look like the tool is available externally, but it’s an interesting read and helps understand some of the “why” behind the solution.

Comments closed

Databricks Integration with Git Repos

Ka-Hing Chueng and Vaibhav Sethi announce Databricks Repos is now generally available:

Thousands of Databricks customers have adopted Databricks Repos since its public preview and have standardized on it for their development and production workflows. Today, we are happy to announce that Databricks Repos is now generally available.

Databricks Repos was created to solve a persistent problem for data teams: most tools used by data engineering/machine learning practitioners offer poor or no integration with Git version control systems, forcing them to navigate through multiple files, steps and UIs to simply review and commit code. Not only is this time-consuming, but it’s also error-prone.

This has been a bit of a pain point with Databricks in the past, and they’ve come up with this solution. Given that Azure Synapse Analytics has some of the same pain points, I’d expect we’ll see something similar in time.

Comments closed

New in SQL Server Big Data Clusters

Daniel Coelho has an update on what’s available in SQL Server Big Data Clusters:

SQL Server Big Data Clusters (BDC) is a capability brought to market as part of the SQL Server 2019 release. Big Data Clusters extends SQL Server’s analytical capabilities beyond in-database processing of transactional and analytical workloads by uniting the SQL engine with Apache Spark and Apache Hadoop to create a single, secure, and unified data platform. It is available exclusively to run on Linux containers, orchestrated by Kubernetes, and can be deployed in multiple-cloud providers or on-premises.

Today, we’re proud to announce the release of the latest cumulative update, CU13, for SQL Server Big Data Clusters which includes important changes and capabilities:

Updating to the most recent production-ready version of Spark (as of today) is a nice upgrade.

Comments closed

pyspark.pandas in Apache Spark 3.2

Hyukjin Kwon and Xinrong Meng announce a built-in pandas API for Apache Spark 3.2:

We’re thrilled to announce the pandas API as part of the upcoming Apache Spark™ 3.2 release. pandas is a powerful, flexible library and has grown rapidly to become one of the standard data science libraries. Now pandas users can leverage the pandas API on their existing Spark clusters.

A few years ago, we launched Koalas, an open source project that implements the pandas DataFrame API on top of Spark, which became widely adopted among data scientists. Recently, Koalas was officially merged into PySpark by SPIP: Support pandas API layer on PySpark as part of Project Zen (see also Project Zen: Making Data Science Easier in PySpark from Data + AI Summit 2021).

pandas users can now scale their workloads with one simple line change in the upcoming Spark 3.2 release:

Click through to see more details on the change.

Comments closed

Adaptive Query Execution in Spark 3

Amarjeet Singh explains what Adaptive Query Execution is in Apache Spark:

As we all know optimization plays an important role in the success of spark SQL. Therefore, a lot of work has been done in this direction. Before spark 3.0, cost-based optimization was a major hit in which different stages related to cost (based on time efficiency and estimated CPU and I/O usage) are compared and executes the strategy which minimizes the cost. But, because of outdated statistics, it has become a sub-optimal technique. Therefore in spark 3.0, Adaptive Query Execution was introduced which aims to solve this by reoptimizing and adjusts the query plans based on runtime statistics collected during query execution. Thus re-optimization of the execution plan occurs after every stage as each stage gives the best place to do the re-optimization.

Item number 2 from the list is also available in SQL Server, giving you an idea that this is an active battleground for query processing in data platform technologies.

Comments closed

Why 200 Tasks for a Spark Execution?

The Hadoop in Real World team explains why you might see 200 tasks when running a Spark job:

It is quite common to see 200 tasks in one of your stages and more specifically at a stage which requires wide transformation. The reason for this is, wide transformations in Spark requires a shuffle. Operations like join, group by etc. are wide transform operations and they trigger a shuffle.

Read on to learn why 200, and whether 200 is the right number for you.

Comments closed

SCD Type 2 with Delta Lake

Chris Williams continues a series on slowly changing dimensions in Delta Lake:

Type 2 SCD is probably one of the most common examples to easily preserve history in a dimension table and is commonly used throughout any Data Warehousing/Modelling architecture. Active rows can be indicated with a boolean flag or a start and end date. In this example from the table above, all active rows can be displayed simply by returning a query where the end date is null.

Read on to see how you can implement this pattern using Delta Lake’s capabilities.

Comments closed

Databricks Notebook Discovery via Notebooks

Darin McBeth creates a meta-noterbook to keep track of notebooks:

Elsevier has been a customer of Databricks for about six years. There are now hundreds of users and tens of thousands of notebooks across their workspace. To some extent, Elsevier’s Databricks users have been a victim of their own success, as there are now too many notebooks to search through to find some earlier work.

The Databricks workspace does provide a keyword search, but we often find the need to define advanced search criteria, such as creator, last updated, programming language, notebook commands and results.

Interestingly, we managed to achieve this functionality using a 100% notebook-based solution with Databricks functionalities. As you will see, this makes it easy to set up in a customer’s Databricks environment.

Read on to see how.

Comments closed