Hadoop In The Trough Of Disillusionment

Kevin Feasel



Alex Woodie has an article about companies moving away from Hadoop:

Instead of trying to fit all the barnyard animals into the name (Cutting suggested Hadoop + Hive + Hbase + Spark + all the others omnivores, as well as “Cutting Con,” which maybe actually would have worked), the conference organizers went back to the roots of the Strata conference in 2011.

(Note to self: it’s ALL about the data.)

That doesn’t mean Hadoop is irrelevant. We will need a place to land unstructured and semi-structured data. But when the biggest Hadoop distributor removes the name of Hadoop from its flagship conference, it’s clearly an indicator that things haven’t gone quite as expected.

I’ve seen several articles along these lines lately and couldn’t resist the Gartner callout.  I consider this a helpful antidote to the “Technology X will solve all your problems!” marketing nonsense, which followed the “Technology X will solve all my problems!” developer nonsense as developers find new and shiny toys.  People are realizing where Hadoop is a great solution and where it’s a bad solution, and the same goes for other technologies; my hope is that after another 9-12 months of “Is Hadoop doomed?” types of articles, it’ll settle out into a long-term growth pattern where people understand its appropriate uses.

R On Athena

Kevin Feasel


Cloud, Hadoop, R

Gopal Wunnava shows how to run R scripts using Amazon Athena as a data source:

Next, you’ll practice interactively querying Athena from R for analytics and visualization. For this purpose, you’ll use GDELT, a publicly available dataset hosted on S3.

Create a table in Athena from R using the GDELT dataset. This step can also be performed from the AWS management console as illustrated in the blog post “Amazon Athena – Interactive SQL Queries for Data in Amazon S3.”

This is an interesting use case for Athena.

Knox With Active Directory

Jon Morisi shows how to configure Knox to work with Active Directory:

I’ve recently been doing some work with Hadoop using the Hortonworks distribution.  Most recently I configured Knox to integrate with Active Directory.  The end goal was to be able to authenticate with Active Directory via Knox (a REST API Gateway) and then on to other services like Hive.  I also configured Knox to point to Zookeeper (HA service discovery) vs. Hive directly, but that’s really more detail than we need for integrating Knox with AD.

The Knox documentation is really good and very helpful:

Worth the read if you’re putting together a Hadoop cluster.

HDInsight Basics: Nodes

Abdullah Al Mahmood explains some of the basics of Azure HDInsight, including what Hadoop means by nodes:

HDInsight clusters consist of several virtual machines (nodes) serving different purposes. The most common architecture of an HDInsight cluster is – two head nodes, one or more worker nodes, and three zookeeper nodes.

Head nodes: Hadoop services are installed and run on head nodes. There are two head nodes to ensure high availability by allowing master services and components to continue to run on the secondary node in the event of a failure on the primary. Both head nodes are active and running within the cluster simultaneously. Some services, such as HDFS or YARN, are only ‘active’ on one head node at any given time (and ‘standby’ on the other head node). Other services such as HiveServer2 or Hive Metastore are active on both head nodes at the same time. There are services like Application Timeline Server (ATS) and Job History Server (JHS) which are installed on both head nodes but should run only on the head node where Ambari server is running. If these components sound unfamiliar, please revisit the article on Hadoop ecosystem in HDInsight.

Read on to see the other classes of nodes HDInsight uses.

Dr. Elephant: Where Does My Hadoop Cluster Hurt?

Carl Steinbach looks back at Dr. Elephant one year later:

What we needed to introduce to the job-tuning equation was a series of questions like those asked by a physician making a diagnosis: a step-by-step process that guides the user through the problem-solving process, while also educating them at the same time.

So we created Dr. Elephant, a system that automatically detects under-performing jobs, diagnoses the root cause, and guides the owner of the job through the treatment process. Dr. Elephant makes it easy to identify jobs that are wasting resources, as well as jobs that can achieve better performance without sacrificing efficiency. Perhaps most importantly, Dr. Elephant makes it easy to act on these insights by making job-level performance tuning accessible to users regardless of their previous skill level. In the process, Dr. Elephant has helped to ease the tension that previously existed between user productivity on one side and cluster efficiency on the other.

LinkedIn has made this project open source if you want to check it out in your environment.

TensorFlow With YARN

Wangda Tan and Vinod Kumar Vavilapalli show how to control TensorFlow jobs with YARN:

YARN has been used successfully to run all sorts of data applications. These applications can all coexist on a shared infrastructure managed through YARN’s centralized scheduling.

With TensorFlow, one can get started with deep learning without much knowledge about advanced math models and optimization algorithms.

If you have GPU-equipped hardware, and you want to run TensorFlow, going through the process of setting up hardware, installing the bits, and optionally also dealing with faults, scaling the app up and down etc. becomes cumbersome really fast. Instead, integrating TensorFlow to YARN allows us to seamlessly manage resources across machine learning / deep learning workloads and other YARN workloads like MapReduce, Spark, Hive, etc.

Read on for more details, including a demo video.


Kevin Feasel


Hadoop, R, Spark

John Mount shows off replyr, which is dplyr for remote, distributed data sets (think SparkR or sparklyr):

Suppose we had a large data set hosted on a Spark cluster that we wished to work with using dplyr and sparklyr (for this article we will simulate such using data loaded into Spark from the nycflights13 package).

We will work a trivial example: taking a quick peek at your data. The analyst should always be able to and willing to look at the data.

It is easy to look at the top of the data, or any specific set of rows of the data.

Read on for more details.


Kevin Feasel



Jiang Mouren has a two-parter on WebHCat.  First, how it works:

SSH shell/Oozie hive action directly interact with YARN for HIVE execution where as Program using HdInsight Jobs SDK/ADF (Azure Data Factory) uses WebHCat REST interface to submit the jobs.

WebHCat is a REST interface for remote jobs (Hive, Pig, Scoop, MapReduce) execution. WebHCat translates the job submission requests into YARN applications and reports the status based on the YARN application status. WebHCat results are coming from YARN and troubleshooting some of them needs to go to YARN.

Then, how to debug issues:

2.1.2. WebHCat times out

HDInsight Gateway times out responses which take longer than 2Minutes resulting in “502 BadGateway”. WebHCat queries YARN services for job status and if they take longer than the request might timeout.

When this happens collect the following logs for further investigation:

/var/log/webchat. Typical contents of directory will be like

  • webhcat.log is the log4j log to which server writes logs
  • webhcat-console.log is stdout of server is started.
  • webhcat-console-error.log is stderr of server process

NOTE: webhcat.log will roll-over daily hence files like webhcat.log.YYYY-MM-DD will also present. For logs to a specific time range make sure that appropriate file is selected.

Because HDInsight doesn’t support WebHDFS, WebHCat is the primary method for cluster access, so it’s good to know.

Garbage Collection In Hadoop

Ranjan Banerjee explains how the Java garbage collector works, using Hadoop as an example:

The reason why we all love Java is due to the fact that we can be careless with memory creations and the work of cleaning the mess is performed by the JVM. On a high level, Java heap memory is classified into two phases:

1) Young (eden) space

2)Old space

The eden space is where newly created objects goto. There are various algorithms for garbage collection, but all of them try to first free memory from the young space and for those long lasting memory objects, they are transferred to the old space.

One common issue that can be noticed in running Map Reduce Applications are GC overhead limit exceeded.

Read on for more, including where you can find GC logs.

Data Lake 3.0

Vinod Kumar Vavilapalli describes the modern data lake:

During the past few years though, end-to-end business use-cases have evolved to another level.

  • The end-to-end business problems are now mostly solved by multiple applications working together.
  • As the platform matured, users have increasingly started wanting to solely focus on the business application layers, and getting impatient to get on with developing their main business-logic.
  • However, YARN, and for that matter any other related platform, hasn’t catered to this evolving need, leaving the users to unwillingly get involved in the painstaking details of wiring applications together, keeping them up, manually scaling them as need arises etc.

Manual plumbing of all these different colored services in tiresome! Further, there is a clear need for seamless aggregate deployment, lifecycle management and application wireup. This is the gap that needs to be bridged between what these end-to-end business use-cases need from the platform and what the platform offers today. If these features are provided, then the business use cases authors can singularly focus on the business logic.

This is a higher-level “where are we at?” kind of post which could be helpful if you’re new to the data lake concept.


May 2017
« Apr