Category: Hadoop

YARN Fundamentals

Published 2018-06-25 by Kevin Feasel

Anushree Subramaniam gives us a primer on Apache YARN, the resource manager which drives Hadoop:

In Hadoop version 1.0 which is also referred to as MRV1(MapReduce Version 1), MapReduce performed both processing and resource management functions. It consisted of a Job Tracker which was the single master. The Job Tracker allocated the resources, performed scheduling and monitored the processing jobs. It assigned map and reduce tasks on a number of subordinate processes called the Task Trackers. The Task Trackers periodically reported their progress to the Job Tracker.

This design resulted in scalability bottleneck due to a single Job Tracker. IBM mentioned in its article that according to Yahoo!, the practical limits of such a design are reached with a cluster of 5000 nodes and 40,000 tasks running concurrently. Apart from this limitation, the utilization of computational resources is inefficient in MRV1. Also, the Hadoop framework became limited only to MapReduce processing paradigm.

To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year 2012 by Yahoo and Hortonworks. The basic idea behind YARN is to relieve MapReduce by taking over the responsibility of Resource Management and Job Scheduling. YARN started to give Hadoop the ability to run non-MapReduce jobs within the Hadoop framework.

There’s a lot of depth to YARN.

Comments closed

HDP 3.0 Released

Published 2018-06-19 by Kevin Feasel

Roni Fontaine and Saumitra Buragohain announce Hortonworks Data Platform version 3.0:

Other additional capabilities include:

Scalability and availability with NameNode federation, allowing customers to scale to thousands of nodes and a billion files. Higher availability with multiple name nodes and standby capabilities allow for the undisrupted, continuous cluster operations if a namenode goes down.
Lower total cost of ownership with erasure coding, providing a data protection method that up to this point has mostly been found in object stores. Hadoop 3 will no longer default to storing three full copies of each piece of data across its clusters. Instead of that 3x hit on storage, the erasure encoding method in Hadoop 3 will incur an overhead of 1.5x while maintaining the same level of data recoverability from disk failure. The end result will be a 50% savings in storage overhead, reducing it by half.
Real-time database, delivering improved query optimization to process more data at a faster rate by eliminating the performance gap between low-latency and high-throughput workloads. Enabled via Apache Hive 3.0, HDP 3.0 offers the only unified SQL solution that can seamlessly combine real-time & historical data, making both available for deep SQL analytics. New features such as workload management enable fine grained resource allocation so no need to worry about resource competition. Materialized views pre-computes and caches the intermediate tables into views where the query optimizer will automatically leverage the pre-computed cache, drastically improve performance. The end result is faster time to insights.
Data science performance improvements around Apache Spark and Apache Hive integration. HDP 3.0 provides seamless Spark integration to the cloud. And containerized TensorFlow technical preview combined with GPU pooling delivers a deep learning framework that makes deep learning faster and easier.

Looks like it’s invite-only at the moment, but that should change pretty soon. It also looks like I’ve got a new weekend project…

Comments closed

Metacat: Federated Metadata Discovery

Published 2018-06-18 by Kevin Feasel

Ajoy Majumdar and Zhen Li walk us through Metacat:

The core architecture of the big data platform at Netflix involves three key services. These are the execution service (Genie), the metadata service, and the event service. These ideas are not unique to Netflix, but rather a reflection of the architecture that we felt would be necessary to build a system not only for the present, but for the future scale of our data infrastructure.

Many years back, when we started building the platform, we adopted Pig as our ETL language and Hive as our ad-hoc querying language. Since Pig did not natively have a metadata system, it seemed ideal for us to build one that could interoperate between both.

Thus Metacat was born, a system that acts as a federated metadata access layer for all data stores we support. A centralized service that our various compute engines could use to access the different data sets. In general, Metacat serves three main objectives:

Federated views of metadata systems

Unified API for metadata about datasets

Arbitrary business and user metadata storage of datasets

It is worth noting that other companies that have large and distributed data sets also have similar challenges. Apache Atlas, Twitter’s Data Abstraction Layer and Linkedin’s WhereHows (Data Discovery at Linkedin), to name a few, are built to tackle similar problems, but in the context of the respective architectural choices of the companies.

If you’re interested, also check out their GitHub repo.

Comments closed

Understanding A Spark Streaming Workflow

Published 2018-06-18 by Kevin Feasel

Himanshu Gupta continues a series on structured streaming using Spark Streaming:

Here we can clearly see that if new data is pushed to the source, Spark will run the “incremental” query that combines the previous running counts with the new data to compute updated counts. The “Input Table” here is the lines DataFrame which acts as a streaming input for wordCounts DataFrame.

Now, the only unknown thing in the above diagram is “Complete Mode“. It is nothing but one of the 3 output modes available in Structured Streaming. Since they are an important part of Structured Streaming, so, let’s read about them in detail:

Complete Mode – This mode updates the entire Result Table which is eventually written to the sink.
Append Mode – In this mode, only the new rows are appended in the Result Table and eventually sent to the sink.
Update Mode – At last, this mode updates only the rows that are changed in the Result Table since the last trigger. Also, only the new rows are sent to the sink. There is one peculiar thing to note about this mode, i.e., it is different from the Complete Mode in the way that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain any aggregations, it is equivalent to the Append mode.

Check it out.

Comments closed

Calculating TF-IDF Using Apache Spark

Published 2018-06-18 by Kevin Feasel

Arseniy Tashoyan shows us how to calculate Term Frequency-Inverse Document Frequency using Apache Spark:

TF-IDF is used in a large variety of applications. Typical use cases include:

Document search.

Document tagging.

Text preprocessing and feature vector engineering for Machine Learning algorithms.

There is a vast number of resources on the web explaining the concept itself and the calculation algorithm. This article does not repeat the information in these other Internet resources, it just illustrates TF-IDF calculation with help of Apache Spark. Emml Asimadi, in his excellent article Understanding TF-IDF, shares an approach based on the old Spark RDD and the Python language. This article, on the other hand, uses the modern Spark SQL API and Scala language.

Although Spark MLlib has an API to calculate TF-IDF, this API is not convenient to learn the concept. MLlib tools are intended to generate feature vectors for ML algorithms. There is no way to figure out the weight for a particular term in a particular document. Well, let’s make it from scratch, this will sharpen our skills.

Read on for the solution. It seems that there tend to be better options today than TF-IDF for natural language problems, but it’s an easy algorithm to understand, so it’s useful as a first go.

Comments closed

Running Hive LLAP As A YARN Service

Published 2018-06-15 by Kevin Feasel

Gour Saha, et al, demonstrate running Apache Hive LLAP as a YARN service:

Making LLAP as a first-class YARN service also enables us to use some of the other powerful features in YARN that were added in Apache Hadoop 3.0 / 3.1, some of them are noted below.

Advanced container placement scheduling such as affinity and anti-affinity. What Slider used to handle in a custom way is now a core first-class feature (YARN-6592).
Rich APIs for users to fetch/query application details using Timeline Service V2 (YARN-2928 and YARN-5355).
New and improved Services UI in YARN UI2 improving debuggability and log access.
Continuous rolling log aggregation of long running containers (YARN-2443).
Auto-restart of containers by NodeManagers (YARN-4725).
Windowing and threshold based container health monitor (YARN-8122).
In the future, we can also leverage YARN level rolling upgrades for containers and the service as a whole (YARN-7512 and YARN-4726).

Looks like it’s been a fruitful transition.

Comments closed

Flattening JSON Data With Databricks

Published 2018-06-11 by Kevin Feasel

Ivan Vazharov gives us a Databricks notebook to parse and flatten JSON using PySpark:

With Databricks you get:

An easy way to infer the JSON schema and avoid creating it manually

Subtle changes in the JSON schema won’t break things

The ability to explode nested lists into rows in a very easy way (see the Notebook below)

Speed!

Following is an example Databricks Notebook (Python) demonstrating the above claims. The JSON sample consists of an imaginary JSON result set, which contains a list of car models within a list of car vendors within a list of people. We want to flatten this result into a dataframe.

Click through for the notebook.

Comments closed

Apache Pulsar 2.0 Released

Published 2018-06-08 by Kevin Feasel

George Leopold reports on a new version of Apache Pulsar:

The startup’s Apache Pulsar 2.0 released on Wednesday (June 6) adds new functionality designed to move data users “beyond batch” processing. Among them is a “stream-native” processing capability called Pulsar Functions designed to apply analytics to data as its flows through the Pulsar platform. Processing functions can be written in either Java or Python, the company said.

Debuted earlier this year as a preview feature, Streamlio announced general availability of Functions this week as part of its 2.0 release.

Another is a Pulsar enhancement developed in conjunction with Apache Bookkeeper, a scalable storage system. Streamlio said the new features, called Topic Compaction, delivers streaming data storage designed to improve the performance of applications consuming data from Pulsar. It serves as a “broker” that builds a snapshot of the latest value for each topic key, the startup said.

Read the whole thing.

Comments closed

Updating Hive Tables

Published 2018-06-07 by Kevin Feasel

Carter Shanklin gives us a few patterns for updating tables in Hive:

Historically, keeping data up-to-date in Apache Hive required custom application development that is complex, non-performant, and difficult to maintain. HDP 2.6 radically simplifies data maintenance with the introduction of SQL MERGE in Hive, complementing existing INSERT, UPDATE, and DELETE capabilities.

This article shows how to solve common data management problems, including:

Hive upserts, to synchronize Hive data with a source RDBMS.
Update the partition where data lives in Hive.
Selectively mask or purge data in Hive.

This isn’t the Hive of 2013; it’s much closer to a real-time warehouse.

Comments closed

Event Hub Performance Tips

Published 2018-06-07 by Kevin Feasel

Vincent-Philippe Lauzon has a few tips for improving Azure Event Hub performance:

Here are some recommendations in the light of the performance and throughput results:

If we send many events: always reuse connections, i.e. do not create a connection only for one event. This is valid for both AMQP and HTTP. A simple Connection Pool pattern makes this easy.

If we send many events & throughput is a concern: use AMQP.

If we send few events and latency is a concern: use HTTP / REST.

If events naturally comes in batch of many events: use batch API.

If events do not naturally comes in batch of many events: simply stream events. Do not try to batch them unless network IO is constrained.

If a latency of 0.1 seconds is a concern: move the call to Event Hubs away from your critical performance path.

Let’s now look at the tests we did to come up with those recommendations.

Read the whole thing.

Comments closed