Vida Ha has an article on troubleshooting when writing code using the Spark APIs:
When working with large datasets, you will have bad input that is malformed or not as you would expect it. I recommend being proactive about deciding for your use case, whether you can drop any bad input, or you want to try fixing and recovering, or otherwise investigating why your input data is bad.
A filter command is a great way to get only your good input points or your bad input data (If you want to look into that more and debug). If you want to fix your input data or to drop it if you cannot, then using a
flatMap()
operation is a great way to accomplish that.
This is a good set of tips.
Comments closed