Gilad Moscovitch walks us through a common data cleansing problem with Spark data frames:
A problem can arise when one of the inner fields of the json, has undesired non-json values in some of the records.For instance, an inner field might contains HTTP errors, that would be interpreted as a string, rather than as a struct.As a result, our schema would look like:root|– headers: struct (nullable = true)| |– items: array (nullable = true)| | |– element: struct (containsNull = true)|– requestBody: string (nullable = true)Instead ofroot|– headers: struct (nullable = true)| |– items: array (nullable = true)| | |– element: struct (containsNull = true)|– requestBody: struct (nullable = true)| |– items: array (nullable = true)| | |– element: struct (containsNull = true)When trying to explode a “string” type, we will get a miss type error:org.apache.spark.sql.AnalysisException: Can’t extract value from requestBody#10
Click through to see how to handle this scenario cleanly.