Gilad Moscovitch walks us through a common data cleansing problem with Spark data frames:
A problem can arise when one of the inner fields of the json, has undesired non-json values in some of the records.
For instance, an inner field might contains HTTP errors, that would be interpreted as a string, rather than as a struct.
As a result, our schema would look like:
root
|– headers: struct (nullable = true)
| |– items: array (nullable = true)
| | |– element: struct (containsNull = true)
|– requestBody: string (nullable = true)
Instead of
root
|– headers: struct (nullable = true)
| |– items: array (nullable = true)
| | |– element: struct (containsNull = true)
|– requestBody: struct (nullable = true)
| |– items: array (nullable = true)
| | |– element: struct (containsNull = true)
When trying to explode a “string” type, we will get a miss type error:
org.apache.spark.sql.AnalysisException: Can’t extract value from requestBody#10
Click through to see how to handle this scenario cleanly.