ElasticMapReduce And RStudio

Tanzir Musabbir demonstrates how to set up Amazon ElasticMapReduce to include an RStudio edge node:

RStudio Server provides a browser-based interface for R and a popular tool among data scientists. Data scientist use Apache Spark cluster running on  Amazon EMR to perform distributed training. In a previous blog post, the author showed how you can install RStudio Server on Amazon EMR cluster. However, in certain scenarios you might want to install it on a standalone Amazon EC2 instance and connect to a remote Amazon EMR cluster. Benefits of running RStudio on EC2 include the following:

  • Running RStudio Server on an EC2 instance, you can keep your scientific models and model artifacts on the instance. You might have to relaunch your EMR cluster to meet your application requirements. By running RStudio Server separately, you have more flexibility and don’t have to depend entirely on an Amazon EMR cluster.
  • Installing RStudio on the master node of Amazon EMR requires sharing of resources with the applications running on the same node. By running RStudio on a standalone Amazon EC2 instance, you can use resources as you need without having to share the resources with other applications.
  • You might have multiple Amazon EMR clusters in your environment. With RStudio on Edge node, you have the flexibility to connect to any EMR clusters in your environment.

There is one major difference between running RStudio Server on an Amazon EMR cluster vs. running it on a standalone Amazon EC2 instance. In the latter case, the instance needs to be configured as an Amazon EMR client (or edge node). By doing so, you can submit Apache Spark jobs and other Hadoop-based jobs from an instance other than EMR master node.

Click through for detailed, step-by-step instructions on how to do this.

Related Posts

Working With The Databricks API Via Powershell

Gerhard Brueckl has a Powershell module for interacting with Databricks, either Azure or AWS: As most of our deployments use PowerShell I wrote some cmdlets to easily work with the Databricks API in my scripts. These included managing clusters (create, start, stop, …), deploying content/notebooks, adding secrets, executing jobs/notebooks, etc. After some time I ended […]

Read More

Migrating A Database To Managed Instances

Frank Gill shows how to migrate a database from on-premises to an Azure SQL Managed Instance: If you have run through my last Managed Instance blog post, you have a Managed Instance at your disposal.  The PowerShell script for creating the network requirements also contains steps to create an Azure VM in a different subnet in […]

Read More


September 2018
« Aug Oct »