Recently, my company faced the serious challenge of loading a 10 million rows of CSV-formatted geographic data to MongoDB in real-time.
We first tried to make a simple Python script to load CSV files in memory and send data to MongoDB. Processing 10 million rows this way took 26 minutes!
26 minutes for processing a dataset in real-time is unacceptable so we decided to proceed differently.
I’m not sure the test was totally fair, but the results comport to my biases… There is some good advice here: storing data in optimized formats (Parquet in this instance) can make a big difference, Spark is useful for ETL style operations, and Scala is generally the fastest language in the Spark world.