Scala + Hadoop + HDInsight

Emmanouil Gkatziouras shows that you can run a Hadoop job written in Scala on Azure’s HDInsight:

Previously, we set up a Scala application in order to execute a simple word count on Hadoop.

What comes next is uploading our application to HDInsight. So, we shall proceed in creating a Hadoop cluster on HDInsight.

Read the whole thing, but the upshot is that Scala apps build jar files just like Java would, so there’s nothing special about running them.

Kinesis vs SQS

Kevin Sookocheff compares and contrasts Amazon’s Kinesis and SQS offerings:

Complicated Producer and Consumer Libraries

For maximum performance, Kinesis requires deploying producer and consumer libraries alongside your application. As a producer, you deploy a C++ binary with a Java interface for reading and writing data records to a Kinesis stream. As a consumer, you deploy a Java application that can communicate with other programming languages through an interface built on top of standard in and standard out. In either of these cases, adding new producers or consumers to a Kinesis stream presents some investment in development and maintenance.

Click through for the full comparison and figuring out where each fits.

Hadoop In The Trough Of Disillusionment

Alex Woodie has an article about companies moving away from Hadoop:

Instead of trying to fit all the barnyard animals into the name (Cutting suggested Hadoop + Hive + Hbase + Spark + all the others omnivores, as well as “Cutting Con,” which maybe actually would have worked), the conference organizers went back to the roots of the Strata conference in 2011.

(Note to self: it’s ALL about the data.)

That doesn’t mean Hadoop is irrelevant. We will need a place to land unstructured and semi-structured data. But when the biggest Hadoop distributor removes the name of Hadoop from its flagship conference, it’s clearly an indicator that things haven’t gone quite as expected.

I’ve seen several articles along these lines lately and couldn’t resist the Gartner callout.  I consider this a helpful antidote to the “Technology X will solve all your problems!” marketing nonsense, which followed the “Technology X will solve all my problems!” developer nonsense as developers find new and shiny toys.  People are realizing where Hadoop is a great solution and where it’s a bad solution, and the same goes for other technologies; my hope is that after another 9-12 months of “Is Hadoop doomed?” types of articles, it’ll settle out into a long-term growth pattern where people understand its appropriate uses.

R On Athena

Gopal Wunnava shows how to run R scripts using Amazon Athena as a data source:

Next, you’ll practice interactively querying Athena from R for analytics and visualization. For this purpose, you’ll use GDELT, a publicly available dataset hosted on S3.

Create a table in Athena from R using the GDELT dataset. This step can also be performed from the AWS management console as illustrated in the blog post “Amazon Athena – Interactive SQL Queries for Data in Amazon S3.”

This is an interesting use case for Athena.

Knox With Active Directory

Jon Morisi shows how to configure Knox to work with Active Directory:

I’ve recently been doing some work with Hadoop using the Hortonworks distribution.  Most recently I configured Knox to integrate with Active Directory.  The end goal was to be able to authenticate with Active Directory via Knox (a REST API Gateway) and then on to other services like Hive.  I also configured Knox to point to Zookeeper (HA service discovery) vs. Hive directly, but that’s really more detail than we need for integrating Knox with AD.

The Knox documentation is really good and very helpful:

Worth the read if you’re putting together a Hadoop cluster.

HDInsight Basics: Nodes

Abdullah Al Mahmood explains some of the basics of Azure HDInsight, including what Hadoop means by nodes:

HDInsight clusters consist of several virtual machines (nodes) serving different purposes. The most common architecture of an HDInsight cluster is – two head nodes, one or more worker nodes, and three zookeeper nodes.

Head nodes: Hadoop services are installed and run on head nodes. There are two head nodes to ensure high availability by allowing master services and components to continue to run on the secondary node in the event of a failure on the primary. Both head nodes are active and running within the cluster simultaneously. Some services, such as HDFS or YARN, are only ‘active’ on one head node at any given time (and ‘standby’ on the other head node). Other services such as HiveServer2 or Hive Metastore are active on both head nodes at the same time. There are services like Application Timeline Server (ATS) and Job History Server (JHS) which are installed on both head nodes but should run only on the head node where Ambari server is running. If these components sound unfamiliar, please revisit the article on Hadoop ecosystem in HDInsight.

Read on to see the other classes of nodes HDInsight uses.

Dr. Elephant: Where Does My Hadoop Cluster Hurt?

Carl Steinbach looks back at Dr. Elephant one year later:

What we needed to introduce to the job-tuning equation was a series of questions like those asked by a physician making a diagnosis: a step-by-step process that guides the user through the problem-solving process, while also educating them at the same time.

So we created Dr. Elephant, a system that automatically detects under-performing jobs, diagnoses the root cause, and guides the owner of the job through the treatment process. Dr. Elephant makes it easy to identify jobs that are wasting resources, as well as jobs that can achieve better performance without sacrificing efficiency. Perhaps most importantly, Dr. Elephant makes it easy to act on these insights by making job-level performance tuning accessible to users regardless of their previous skill level. In the process, Dr. Elephant has helped to ease the tension that previously existed between user productivity on one side and cluster efficiency on the other.

LinkedIn has made this project open source if you want to check it out in your environment.

TensorFlow With YARN

Wangda Tan and Vinod Kumar Vavilapalli show how to control TensorFlow jobs with YARN:

YARN has been used successfully to run all sorts of data applications. These applications can all coexist on a shared infrastructure managed through YARN’s centralized scheduling.

With TensorFlow, one can get started with deep learning without much knowledge about advanced math models and optimization algorithms.

If you have GPU-equipped hardware, and you want to run TensorFlow, going through the process of setting up hardware, installing the bits, and optionally also dealing with faults, scaling the app up and down etc. becomes cumbersome really fast. Instead, integrating TensorFlow to YARN allows us to seamlessly manage resources across machine learning / deep learning workloads and other YARN workloads like MapReduce, Spark, Hive, etc.

Read on for more details, including a demo video.


John Mount shows off replyr, which is dplyr for remote, distributed data sets (think SparkR or sparklyr):

Suppose we had a large data set hosted on a Spark cluster that we wished to work with using dplyr and sparklyr (for this article we will simulate such using data loaded into Spark from the nycflights13 package).

We will work a trivial example: taking a quick peek at your data. The analyst should always be able to and willing to look at the data.

It is easy to look at the top of the data, or any specific set of rows of the data.

Read on for more details.


Jiang Mouren has a two-parter on WebHCat.  First, how it works:

SSH shell/Oozie hive action directly interact with YARN for HIVE execution where as Program using HdInsight Jobs SDK/ADF (Azure Data Factory) uses WebHCat REST interface to submit the jobs.

WebHCat is a REST interface for remote jobs (Hive, Pig, Scoop, MapReduce) execution. WebHCat translates the job submission requests into YARN applications and reports the status based on the YARN application status. WebHCat results are coming from YARN and troubleshooting some of them needs to go to YARN.

Then, how to debug issues:

2.1.2. WebHCat times out

HDInsight Gateway times out responses which take longer than 2Minutes resulting in “502 BadGateway”. WebHCat queries YARN services for job status and if they take longer than the request might timeout.

When this happens collect the following logs for further investigation:

/var/log/webchat. Typical contents of directory will be like

  • webhcat.log is the log4j log to which server writes logs
  • webhcat-console.log is stdout of server is started.
  • webhcat-console-error.log is stderr of server process

NOTE: webhcat.log will roll-over daily hence files like webhcat.log.YYYY-MM-DD will also present. For logs to a specific time range make sure that appropriate file is selected.

Because HDInsight doesn’t support WebHDFS, WebHCat is the primary method for cluster access, so it’s good to know.


March 2017
« Feb