Auto-Scaling Amazon ElasticMapReduce

Brandon Schller gives us some tips on sizing and scaling ElasticMapReduce:

EMR scaling is more complex than simply adding or removing nodes from the cluster. One common misconception is that scaling in Amazon EMR works exactly like Amazon EC2 scaling. With EC2 scaling, you can add/remove nodes almost instantly and without worry, but EMR has more complexity to it, especially when scaling a cluster down. This is because important data or jobs could be running on your nodes.

To prevent data loss, Amazon EMR scaling ensures that your node has no running Apache Hadoop tasks or unique data that could be lost before removing your node. It is worth considering this decommissioning delay when resizing your EMR cluster. By understanding and accounting for how this process works, you can avoid issues that have plagued others, such as slow cluster resizes and inefficient automatic scaling policies.

If you’re using EMR today or think you might use it in the future, you should read this.

Related Posts

Auto ML With SQL Server 2019 Big Data Clusters

Marco Inchiosa has a model scenario for using Big Data Clusters to scale out a machine learning problem: H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and Both APIs use the same underlying algorithm implementations, […]

Read More

Erasure Coding In Hadoop

Guy Shilo explains erasure coding, a new feature in Hadoop 3: The benefits are, of course, space-saving, and for large files also improved performance (blocks striped across datanodes can be read in parallel, and less blocks are written because there is no x3 replication). The larger the file the more notable is the performance gain. […]

Read More


July 2018
« Jun Aug »