Processing Fixed-Width Files with Spark

Subhasish Guha shows how you can read a fixed-with file with Apache Spark:

A fixed width file is a very common flat file format when working with SAP, Mainframe, and Web Logs. Converting the data into a dataframe using metadata is always a challenge for Spark Developers. This particular article talks about all kinds of typical scenarios that a developer might face while working with a fixed witdth file. This solution is generic to any fixed width file and very easy to implement. This also takes care of the Tail Safe Stack as the RDD gets into the foldLeft operator.

It’s a little more complicated than with R, where stringr can handle fixed-width formats. But it’s not bad.

Related Posts

Spark Streaming DStreams

Manish Mishra explains the fundamental abstraction of Spark Streaming: Before going into details of the operations available on the DStream API, let us look at the input sources from which we can start a Stream. There are multiple ways in which we can get the inputs from e.g. Kafka, Flume, etc. Or simple Idle files. […]

Read More

Multi-Region Replication with Confluent Platform

David Arthur walks us through multi-region replication of Kafka clusters in the Confluent Platform 5.4 preview: Running a single Apache Kafka® cluster across multiple datacenters (DCs) is a common, yet somewhat taboo architecture. This architecture, referred to as a stretch cluster, provides several operational benefits and unlocks the door to many uses cases. Stretch clusters provide […]

Read More

Categories

April 2019
MTWTFSS
« Mar May »
1234567
891011121314
15161718192021
22232425262728
2930