Li Jin talks about a performance optimization coming in Apache Spark 2.3:
To enable data scientists to leverage the value of big data, Spark added a Python API in version 0.7, with support for user-defined functions. These user-defined functions operate one-row-at-a-time, and thus suffer from high serialization and invocation overhead. As a result, many data pipelines define UDFs in Java and Scala, and then invoke them from Python.
Vectorized UDFs built on top of Apache Arrow bring you the best of both worlds—the ability to define low-overhead, high performance UDFs entirely in Python.
This looks like a good performance improvement coming to PySpark, bringing it closer to Scala/Java performance with respect to UDFs.
Comments closed