What’s New In Hadoop 3.1?

Kevin Feasel



Wangda Tan, et al, look at some of the new features in Apache Hadoop 3.1:

The diagram below captures the building blocks together at a high level. If you have to tie this back to a fictitious self-flying drone company, the company will collect tons of raw images from the test drones’ built-in cameras for computer vision. Those images can be stored in the Apache Hadoop data lake in a cost-effective (with erasure coding) yet highly available manner (multiple standby namenodes). Instead of providing GPU machines to each of the data scientists, GPU cards are pooled across the cluster for access by multiple data scientists. GPU cards in each server can be isolated for sharing between multiple users.

Support of Docker containerized workloads means that data scientists/data engineers can bring the deep learning frameworks to the Apache Hadoop data lake and there is no need to have a separate compute/GPU cluster. GPU pooling allows the application of the deep learning neural network algorithms and the training of the data-intensive models using the data collected in the data lake at a speed almost 100x faster than regular CPUs.

If the customer wants to pool the FPGA (field programmable gate array) resources instead of GPUs, this is also possible in Apache Hadoop 3.1. Additionally, use of affinity and anti-affinity labels allows us to control how we deploy the microservices in the clusters — some of the components can be set to have anti-affinity so that they are always deployed in separate physical servers.

It’s interesting to see Hadoop evolve over time as the ecosystem solves more real-time problems instead of focusing on giant batch problems.

Related Posts

Running Apache Kafka On Kubernetes

Rohit Bakhshi walks us through how to install Kafka on a Kubernetes cluster: Now available on GitHub in developer preview are open-source Helm Chart deployment templates for Confluent Platform components. These templates enable developers to quickly provision Apache Kafka, Apache ZooKeeper, Confluent Schema Registry, Confluent REST Proxy, and Kafka Connect on Kubernetes, using official Confluent Platform Docker images. Helm is […]

Read More

Databricks Delta: Data Skipping And ZORDER Clustering

Adrian Ionescu explains a couple of concepts which can help make selective queries with Databricks much faster: The general use-case for these features is to improve the performance of needle-in-the-haystack kind of queries against huge data sets. The typical RDBMS solution, namely secondary indexes, is not practical in a big data context due to scalability […]

Read More


May 2018
« Apr Jun »