YARN Fundamentals

Kevin Feasel

2018-06-25

Hadoop

Anushree Subramaniam gives us a primer on Apache YARN, the resource manager which drives Hadoop:

In Hadoop version 1.0 which is also referred to as MRV1(MapReduce Version 1), MapReduce performed both processing and resource management functions. It consisted of a Job Tracker which was the single master. The Job Tracker allocated the resources, performed scheduling and monitored the processing jobs. It assigned map and reduce tasks on a number of subordinate processes called the Task Trackers. The Task Trackers periodically reported their progress to the Job Tracker.

This design resulted in scalability bottleneck due to a single Job Tracker. IBM mentioned in its article that according to Yahoo!, the practical limits of such a design are reached with a cluster of 5000 nodes and 40,000 tasks running concurrently. Apart from this limitation, the utilization of computational resources is inefficient in MRV1. Also, the Hadoop framework became limited only to MapReduce processing paradigm.

To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year 2012 by Yahoo and Hortonworks. The basic idea behind YARN is to relieve MapReduce by taking over the responsibility of Resource Management and Job Scheduling. YARN started to give Hadoop the ability to run non-MapReduce jobs within the Hadoop framework.

There’s a lot of depth to YARN.

Related Posts

Testing Kafka Streams Applications

Yeva Byzek continues her series on testing Kafka-based streaming applications: When you create a stream processing application with Kafka’s Streams API, you create a Topologyeither using the StreamsBuilder DSL or the low-level Processor API. Normally, the topology runs with the KafkaStreams class, which connects to a Kafka cluster and begins processing when you call start(). For testing though, connecting to a running […]

Read More

Auto ML With SQL Server 2019 Big Data Clusters

Marco Inchiosa has a model scenario for using Big Data Clusters to scale out a machine learning problem: H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, […]

Read More

Categories

June 2018
MTWTFSS
« May Jul »
 123
45678910
11121314151617
18192021222324
252627282930