Databricks UDF Performance Testing

Tristan Robinson shares some performance comps for different Azure Databricks scenarios:

I’ve recently been spending quite a bit of time on the Azure Databricks platform, and while learning decided it was worth using it to experiment with some common data warehousing tasks in the form of data cleansing. As Databricks provides us with a platform to run a Spark environment on, it offers options to use cross-platform APIs that allow us to write code in Scala, Python, R, and SQL within the same notebook. As with most things in life, not everything is equal and there are potential differences in performance between them. In this blog, I will explain the tests I produced with the aim of outlining best practice for Databricks implementations for UDFs of this nature.

Scala is the native language for Spark – and without going into too much detail here, it will compile down faster to the JVM for processing. Under the hood, Python on the other hand provides a wrapper around the code but in reality is a Scala program telling the cluster what to do, and being transformed by Scala code. Converting these objects into a form Python can read is called serialisation / deserialisation, and its expensive, especially over time and across a distributed dataset. This most expensive scenario occurs through UDFs (functions) – the runtime process for which can be seen below. The overhead here is in (4) and (5) to read the data and write into JVM memory.

Click through for the results.  Looks like Python barely beat out Scala for the #1 position, but Scala was a little faster than Python in-class (e.g., the Scala program with a Scala SQL UDF was a little bit faster than the Python equivalent).

Related Posts

Working with Columns in Spark

Achilleus has a two-parter on working with columns in Spark. Part 1 covers some of the basic syntax and several functions: Also, we can have typed columns which is basically a column with an expression encoder specified for the expected input and return type. scala> val name = $"name".as[String]name: org.apache.spark.sql.TypedColumn[Any,String] = namescala> val name = […]

Read More

Creating Threadpools with ExecutorService in Kafka

Prasanth Nair shows how we can use Java’s ExecutorService to create threadpools for Kafka consumers: Apache Kafka is one of today’s most commonly used event streaming platforms. While using the Kafka platform, quite often, we run into a scenario where we have to process a large number of events/messages that are placed on a broker. […]

Read More


September 2018
« Aug Oct »