Load Testing Spark To MongoDB

Abdelghani Tassi has a quick load test to see how fast Spark can load data into MongoDB:

Recently, my company faced the serious challenge of loading a 10 million rows of CSV-formatted geographic data to MongoDB in real-time.

We first tried to make a simple Python script to load CSV files in memory and send data to MongoDB. Processing 10 million rows this way took 26 minutes!

26 minutes for processing a dataset in real-time is unacceptable so we decided to proceed differently.

I’m not sure the test was totally fair, but the results comport to my biases…  There is some good advice here:  storing data in optimized formats (Parquet in this instance) can make a big difference, Spark is useful for ETL style operations, and Scala is generally the fastest language in the Spark world.

Related Posts

What’s New In Ambari 2.7

Paul Codding and Kat Petre share some of the new features in Ambari 2.7: With this release, we wanted to make Ambari more enjoyable to use every day, simplify finding and using our API, and unblock teams managing very large clusters.  Here is a preview of a few features we’re excited to share with you:Revamped […]

Read More

Working With Images In Spark 2.4

Tomas Nykodym and Weichen Xu give us an update on working with images in the most recent version of Apache Spark: An image data source addresses many of these problems by providing the standard representation you can code against and abstracts from the details of a particular image representation.Apache Spark 2.3 provided the ImageSchema.readImages API (see Microsoft’s post […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Categories

December 2018
MTWTFSS
« Nov  
 12
3456789
10111213141516
17181920212223
24252627282930
31