Vectorized UDFs For PySpark

Li Jin talks about a performance optimization coming in Apache Spark 2.3:

To enable data scientists to leverage the value of big data, Spark added a Python API in version 0.7, with support for user-defined functions. These user-defined functions operate one-row-at-a-time, and thus suffer from high serialization and invocation overhead. As a result, many data pipelines define UDFs in Java and Scala, and then invoke them from Python.

Vectorized UDFs built on top of Apache Arrow bring you the best of both worlds—the ability to define low-overhead, high performance UDFs entirely in Python.

This looks like a good performance improvement coming to PySpark, bringing it closer to Scala/Java performance with respect to UDFs.

Related Posts

Using Convolutional Neural Networks To Recognize Features In Images

Michael Grogan shows how you can use Keras to perform image recognition with a convolutional neural network: VGG16 is a built-in neural network in Keras that is pre-trained for image recognition. Technically, it is possible to gather training and test data independently to build the classifier. However, this would necessitate at least 1,000 images, with […]

Read More

Working With Skewed Data In Pig

Dmitry Tolpeko explains how you can use the Weighted Range Partitioner in Apache Pig to work with highly skewed data: The problem is that there are 3,000 map tasks are launched to read the daily data and there are 250 distinct event types, so the mappers will produce 3,000 * 250 = 750,000 files per day. That’s […]

Read More


November 2017
« Oct Dec »