Press "Enter" to skip to content

Category: Hadoop

Tuning Kafka And Spark Data Pipelines

Larry Murdock explains the tuning options available to Kafka and Spark Streams:

Kafka is not the Ferrari of messaging middleware, rather it is the salt flats rocket car. It is fast, but don’t expect to find an AUX jack for your iPhone. Everything is stripped down for speed.

Compared to other messaging middleware, the core is simpler and handles fewer features. It is a transaction log and its job is to take the message you sent asynchronously and write it to disk as soon as possible, returning an acknowledgement once it is committed via an optional callback. You can force a degree of synchronicity by chaining a get to the send call, but that is kind of cheating Kafka’s intention. It does not send it on to a receiver. It only does pub-sub. It does not handle back pressure for you.

I like this as a high-level overview of the different options available.  Definitely gets a More Research Is Required tag, but this post helps you figure out where to go next.

Comments closed

The Hive Metastore In HDInsight

Ashish Thapliyal shows how to create a custom Hive metastore in HDInsight:

Custom Metastore – HDInsight lets you pick custom Metastore. It’s a recommended approach for production clusters due to number reasons such as

  • You bring your own Azure SQL database as Metastore

  • As lifecycle of Metastore is not tied to a cluster lifecycle, you can create and delete clusters without worrying about the metadata loss.

  • Custom Metastore lets you attach multiple clusters and cluster types to same Metastore. Example – Single Metastore can be shared across Interactive Hive, Hive and Spark clusters in HDInsight

  • You pay for the cost of Metastore (Azure SQL DB)

Read on to see how to do this.

Comments closed

Scala + Hadoop + HDInsight

Emmanouil Gkatziouras shows that you can run a Hadoop job written in Scala on Azure’s HDInsight:

Previously, we set up a Scala application in order to execute a simple word count on Hadoop.

What comes next is uploading our application to HDInsight. So, we shall proceed in creating a Hadoop cluster on HDInsight.

Read the whole thing, but the upshot is that Scala apps build jar files just like Java would, so there’s nothing special about running them.

Comments closed

Kinesis vs SQS

Kevin Sookocheff compares and contrasts Amazon’s Kinesis and SQS offerings:

Complicated Producer and Consumer Libraries

For maximum performance, Kinesis requires deploying producer and consumer libraries alongside your application. As a producer, you deploy a C++ binary with a Java interface for reading and writing data records to a Kinesis stream. As a consumer, you deploy a Java application that can communicate with other programming languages through an interface built on top of standard in and standard out. In either of these cases, adding new producers or consumers to a Kinesis stream presents some investment in development and maintenance.

Click through for the full comparison and figuring out where each fits.

Comments closed

Hadoop In The Trough Of Disillusionment

Alex Woodie has an article about companies moving away from Hadoop:

Instead of trying to fit all the barnyard animals into the name (Cutting suggested Hadoop + Hive + Hbase + Spark + all the others omnivores, as well as “Cutting Con,” which maybe actually would have worked), the conference organizers went back to the roots of the Strata conference in 2011.

(Note to self: it’s ALL about the data.)

That doesn’t mean Hadoop is irrelevant. We will need a place to land unstructured and semi-structured data. But when the biggest Hadoop distributor removes the name of Hadoop from its flagship conference, it’s clearly an indicator that things haven’t gone quite as expected.

I’ve seen several articles along these lines lately and couldn’t resist the Gartner callout.  I consider this a helpful antidote to the “Technology X will solve all your problems!” marketing nonsense, which followed the “Technology X will solve all my problems!” developer nonsense as developers find new and shiny toys.  People are realizing where Hadoop is a great solution and where it’s a bad solution, and the same goes for other technologies; my hope is that after another 9-12 months of “Is Hadoop doomed?” types of articles, it’ll settle out into a long-term growth pattern where people understand its appropriate uses.

Comments closed

R On Athena

Gopal Wunnava shows how to run R scripts using Amazon Athena as a data source:

Next, you’ll practice interactively querying Athena from R for analytics and visualization. For this purpose, you’ll use GDELT, a publicly available dataset hosted on S3.

Create a table in Athena from R using the GDELT dataset. This step can also be performed from the AWS management console as illustrated in the blog post “Amazon Athena – Interactive SQL Queries for Data in Amazon S3.”

This is an interesting use case for Athena.

Comments closed

Knox With Active Directory

Jon Morisi shows how to configure Knox to work with Active Directory:

I’ve recently been doing some work with Hadoop using the Hortonworks distribution.  Most recently I configured Knox to integrate with Active Directory.  The end goal was to be able to authenticate with Active Directory via Knox (a REST API Gateway) and then on to other services like Hive.  I also configured Knox to point to Zookeeper (HA service discovery) vs. Hive directly, but that’s really more detail than we need for integrating Knox with AD.

The Knox documentation is really good and very helpful:
https://knox.apache.org/books/knox-0-9-0/user-guide.html

Worth the read if you’re putting together a Hadoop cluster.

Comments closed

HDInsight Basics: Nodes

Abdullah Al Mahmood explains some of the basics of Azure HDInsight, including what Hadoop means by nodes:

HDInsight clusters consist of several virtual machines (nodes) serving different purposes. The most common architecture of an HDInsight cluster is – two head nodes, one or more worker nodes, and three zookeeper nodes.

Head nodes: Hadoop services are installed and run on head nodes. There are two head nodes to ensure high availability by allowing master services and components to continue to run on the secondary node in the event of a failure on the primary. Both head nodes are active and running within the cluster simultaneously. Some services, such as HDFS or YARN, are only ‘active’ on one head node at any given time (and ‘standby’ on the other head node). Other services such as HiveServer2 or Hive Metastore are active on both head nodes at the same time. There are services like Application Timeline Server (ATS) and Job History Server (JHS) which are installed on both head nodes but should run only on the head node where Ambari server is running. If these components sound unfamiliar, please revisit the article on Hadoop ecosystem in HDInsight.

Read on to see the other classes of nodes HDInsight uses.

Comments closed

Dr. Elephant: Where Does My Hadoop Cluster Hurt?

Carl Steinbach looks back at Dr. Elephant one year later:

What we needed to introduce to the job-tuning equation was a series of questions like those asked by a physician making a diagnosis: a step-by-step process that guides the user through the problem-solving process, while also educating them at the same time.

So we created Dr. Elephant, a system that automatically detects under-performing jobs, diagnoses the root cause, and guides the owner of the job through the treatment process. Dr. Elephant makes it easy to identify jobs that are wasting resources, as well as jobs that can achieve better performance without sacrificing efficiency. Perhaps most importantly, Dr. Elephant makes it easy to act on these insights by making job-level performance tuning accessible to users regardless of their previous skill level. In the process, Dr. Elephant has helped to ease the tension that previously existed between user productivity on one side and cluster efficiency on the other.

LinkedIn has made this project open source if you want to check it out in your environment.

Comments closed

TensorFlow With YARN

Wangda Tan and Vinod Kumar Vavilapalli show how to control TensorFlow jobs with YARN:

YARN has been used successfully to run all sorts of data applications. These applications can all coexist on a shared infrastructure managed through YARN’s centralized scheduling.

With TensorFlow, one can get started with deep learning without much knowledge about advanced math models and optimization algorithms.

If you have GPU-equipped hardware, and you want to run TensorFlow, going through the process of setting up hardware, installing the bits, and optionally also dealing with faults, scaling the app up and down etc. becomes cumbersome really fast. Instead, integrating TensorFlow to YARN allows us to seamlessly manage resources across machine learning / deep learning workloads and other YARN workloads like MapReduce, Spark, Hive, etc.

Read on for more details, including a demo video.

Comments closed