Spark: DataFrame To RDD For Data Cleansing

Gilad Moscovitch walks us through a common data cleansing problem with Spark data frames:

A problem can arise when one of the inner fields of the json, has undesired non-json values in some of the records.

For instance, an inner field might contains HTTP errors, that would be interpreted as a string, rather than as a struct.

As a result, our schema would look like:

root

|– headers: struct (nullable = true)

| |– items: array (nullable = true)

| | |– element: struct (containsNull = true)

|– requestBody: string (nullable = true)

Instead of

root

|– headers: struct (nullable = true)

| |– items: array (nullable = true)

| | |– element: struct (containsNull = true)

|– requestBody: struct (nullable = true)

| |– items: array (nullable = true)

| | |– element: struct (containsNull = true)

When trying to explode a “string” type, we will get a miss type error:

org.apache.spark.sql.AnalysisException: Can’t extract value from requestBody#10

Click through to see how to handle this scenario cleanly.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31