Ivan Vazharov gives us a Databricks notebook to parse and flatten JSON using PySpark:
With Databricks you get:
- An easy way to infer the JSON schema and avoid creating it manually
- Subtle changes in the JSON schema won’t break things
- The ability to explode nested lists into rows in a very easy way (see the Notebook below)
- Speed!
Following is an example Databricks Notebook (Python) demonstrating the above claims. The JSON sample consists of an imaginary JSON result set, which contains a list of car models within a list of car vendors within a list of people. We want to flatten this result into a dataframe.
Click through for the notebook.