Vectorized UDFs For PySpark

Li Jin talks about a performance optimization coming in Apache Spark 2.3:

To enable data scientists to leverage the value of big data, Spark added a Python API in version 0.7, with support for user-defined functions. These user-defined functions operate one-row-at-a-time, and thus suffer from high serialization and invocation overhead. As a result, many data pipelines define UDFs in Java and Scala, and then invoke them from Python.

Vectorized UDFs built on top of Apache Arrow bring you the best of both worlds—the ability to define low-overhead, high performance UDFs entirely in Python.

This looks like a good performance improvement coming to PySpark, bringing it closer to Scala/Java performance with respect to UDFs.

Related Posts

An Apache Sqoop Tutorial

Subham Sinha has an introductory-level tutorial on Apache Sqoop: For Hadoop developer, the actual game starts after the data is being loaded in HDFS. They play around this data in order to gain various insights hidden in the data stored in HDFS. So, for this analysis the data residing in the relational database management systems […]

Read More

Housing Prices In Ames, Iowa: A Kaggle Competition

Kathryn Bryant and M. Aaron Owen share their Kaggle experiences.  First, Kathryn, et al: The lifecycle of our project was a typical one. We started with data cleaning and basic exploratory data analysis, then proceeded to feature engineering, individual model training, and ensembling/stacking. Of course, the process in practice was not quite so linear and […]

Read More


November 2017
« Oct