Divyansh Jain shows three techniques for handling invalid input data with Apache Spark:
Most of the time writing ETL jobs becomes very expensive when it comes to handling corrupt records. And in such cases, ETL pipelines need a good solution to handle corrupted records. Because, larger the ETL pipeline is, the more complex it becomes to handle such bad records in between. Corrupt data includes:
– Missing information
– Incomplete information
– Schema mismatch
– Differing formats or data typesSince ETL pipelines are built to be automated, production-oriented solutions must ensure pipelines behave as expected. This means that data engineers must both expect and systematically handle corrupt records.
This is the seedy underbelly of semi-structured data: you don’t have control over the data as it comes in, so you have to control the data coming out.