Building A Multi-Node Hadoop Cluster With Spark

Rao Swati has a step-by-step instruction guide on how to set up a multi-node cluster with Hadoop 2.7.3 and Spark 1.6.2:

Important Notes:

  1. Start-dfs.sh  will start NameNode, SecondaryNamenode, DataNode on master and DataNode on all slaves node.
  2. Start-yarn.sh  will start NodeManager, ResourceManager on the master node and NodeManager on slaves.
  3. Perform  Hadoop namenode -format  only once otherwise you will get an incompatible cluster_id exception. To resolve this error clear temporary data location for datanode i.e, remove the files present in $HADOOP_HOME/dfs/name/data folder.

If you’d like to set up your own Hadoop cluster rather than using one of the big vendors (Hortonworks, Cloudera, MapR) or a PaaS solution like HDInsight or ElasticMapReduce, this will give you a head start.

Related Posts

Kafka 2.3 and Kafka Connect Improvements

Robin Moffatt goes over improvements in Kafka Connect with the release of Apache Kafka 2.3: A Kafka Connect cluster is made up of one or more worker processes, and the cluster distributes the work of connectors as tasks. When a connector or worker is added or removed, Kafka Connect will attempt to rebalance these tasks. Before version 2.3 of Kafka, […]

Read More

The Databricks File System

Brad Llewellyn takes us through the Azure Databricks File System: Today, we’re going to talk about the Databricks File System (DBFS) in Azure Databricks.  If you haven’t read the previous posts in this series, Introduction, Cluster Creation and Notebooks, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let’s move on […]

Read More

Categories

December 2016
MTWTFSS
« Nov Jan »
 1234
567891011
12131415161718
19202122232425
262728293031