Apache Avro is a popular data serialization format. It is widely used in the Apache Spark and Apache Hadoop ecosystem, especially for Kafka-based data pipelines. Starting from Apache Spark 2.4 release, Spark provides built-in support for reading and writing Avro data. The new built-in spark-avro module is originally from Databricks’ open source project Avro Data Source for Apache Spark (referred to as spark-avro from now on). In addition, it provides:
- New functions from_avro() and to_avro() to read and write Avro data within a DataFrame instead of just files.
- Avro logical types support, including Decimal, Timestamp, and Date types. See the related schema conversions for details.
- 2X read throughput improvement and 10% write throughput improvement.
In this blog, we examine each of the above features through examples, giving you a flavor of its easy API usage, performance improvements, and merits.
Avro is one of the better rowstore data formats in the Hadoop world, so it’s good to see built-in support here.