Category: Hadoop

You can configure Spot instances in Cloudera Director’s instance templates. These instance templates contain a flag indicating whether Spot instances should be used, as well as a field specifying the bid price for those instances.

Each instance group in the cluster template includes a field that indicates the minimum number of instances required in that group for the cluster to be considered successful. Cloudera Director will continue with bootstrapping or growing a cluster if the minimum count for each instance group is satisfied. Spot instances should not be used for instance groups that are required for the normal operation of the cluster, such as HDFS DataNodes. Instance groups configured to use Spot instances should set their minimum number to zero with the expectation that the instances may not be provisioned due to the Spot bid price being lower than the Spot price.

If you’re able to take advantage of spot instances, you can end up saving a pretty good amount of money.

Comments closed

Event Sourcing And Kafka

Published 2017-08-09 by Kevin Feasel

Ben Stopford gives an architectural overview of how Apache Kafka can act as an event sourcing system:

At a high level, event sourcing is really just the observation that in an event driven architecture, the events are facts. So if you keep them around, you can use them as a datasource.

One subtlety comes from the way the events are modelled. They can be values: whole facts (an Order, in its entirety) or they can be a set of ‘deltas’ that must be re-combined (a whole Order message, followed by messages denoting just the state changes: “amount updated to $5”, “order cancelled” etc).

As an analogy, imagine you are building a version control system. When a user commits a file for the first time, you save it. Subsequent commits might only save the ‘delta’: just the lines that were added, changed or removed. Then, when the user performs a checkout, you open the version-0 file and apply all the deltas, to derive to the current state.

The alternate approach is to simply store the whole file, exactly as it was at the time it was changed. This makes pulling out a version quick and easy, but to compare different versions you would have to perform a ‘diff’.

The series Ben has been working through is very helpful in wrapping your mind around what Kafka can do, and this post was no exception.

Comments closed

What’s New In Ambari 2.5

Published 2017-08-09 by Kevin Feasel

Paul Codding tells us what’s coming in the next version of Ambari:

Ambari Log Search (Tech Preview) has been one of our most popular features, and in this release has seen UI, and backend refreshes based on customer feedback. Log Search is planned for GA with the next major Ambari release, Ambari 3.0 in which the UI will be simplified, and the backend will have more robust log retention and scaling capabilities.

There are some interesting changes, so read the whole thing.

Comments closed

Using Hive LLAP On ElasticMapReduce

Published 2017-08-08 by Kevin Feasel

Jigar Mistry shows how to configure and use Hive LLAP on AWS’s ElasticMapReduce:

With many options available in the market (Presto, Spark SQL, etc.) for doing interactive SQL over data that is stored in Amazon S3 and HDFS, there are several reasons why using Hive and LLAP might be a good choice:

For those who are heavily invested in the Hive ecosystem and have external BI tools that connect to Hive over JDBC/ODBC connections, LLAP plugs in to their existing architecture without a steep learning curve.
It’s compatible with existing Hive SQL and other Hive tools, like HiveServer2, and JDBC drivers for Hive.
It has native support for security features with authentication and authorization (SQL standards-based authorization) using HiveServer2.
LLAP daemons are aware about of the columns and records that are being processed which enables you to enforce fine-grained access control.
It can use Hive’s vectorization capabilities to speed up queries, and Hive has better support for Parquet file format when vectorization is enabled.
It can take advantage of a number of Hive optimizations like merging multiple small files for query results, automatically determining the number of reducers for joins and groupbys, etc.
It’s optional and modular so it can be turned on or off depending on the compute and resource requirements of the cluster. This lets you to run other YARN applications concurrently without reserving a cluster specifically for LLAP.

Read on for more details, including the bootstrap action you need to take and how to use LLAP once you have it configured.

Comments closed

Unit Testing Kafka Streams

Published 2017-08-07 by Kevin Feasel

Anuj Saxena shows us how to build mocks for streams in Kafka Streams:

Here, we are using Kafka streams in our applications. We are done with the implementation but again, the most important thing left is testing. This blog is about how to test the application we have created. For this, I’ll be taking the sample app I created in my previous blog for both high-level DSL and low-level processor API.

Traditionally, we test our Kafka application with an integration test for which we need to create a ZooKeeper and a real Kafka broker. After that, we need a mock producer and mock consumer for our app to produce the inputs and receive the outputs. That makes it such a big hassle just to test our app. Testing it for real scenarios and for the actual integration test, this is needed without a doubt.

Click through for an example.

Comments closed

More On S3Guard

Published 2017-08-04 by Kevin Feasel

Aaron Fabbri describes how S3Guard works:

Although Apache Hadoop has support for using Amazon Simple Storage Service (S3) as a Hadoop filesystem, S3 behaves different than HDFS. One of the key differences is in the level of consistency provided by the underlying filesystem. Unlike HDFS, S3 is an eventually consistent filesystem. This means that changes made to files on S3 may not be visible for some period of time.

Many Hadoop components, however, depend on HDFS consistency for correctness. While S3 usually appears to “work” with Hadoop, there are a number of failures that do sometimes occur due to inconsistency:

FileNotFoundExceptions. Processes that write data to a directory and then list that directory may fail when the data they wrote is not visible in the listing. This is a big problem with Spark, for example.
Flaky test runs that “usually” work. For example, our root directory integration tests for Hadoop’s S3A connector occasionally fail due to eventual consistency. This is due to assertions about the directory contents failing. These failures occur more frequently when we run tests in parallel, increasing stress on the S3 service and making delayed visibility more common.
Missing data that is silently dropped. Multi-step Hadoop jobs that depend on output of previous jobs may silently omit some data. This omission happens when a job chooses which files to consume based on a directory listing, which may not include recently-written items.

Worth reading if you’re looking at using S3 to store data for Hadoop. Also check out an earlier post on the topic.

Comments closed

Chaining Exactly-Once Operations With Kafka

Published 2017-07-31 by Kevin Feasel

Ben Stopford shows how you can use Kafka to chain together services while maintaining exactly-once guarantees:

Any service-based architecture is itself a distributed system, a field renowned for being difficult, particularly when things go wrong. We have thought experiments like The Two Generals Problemand proofs like FLP which highlight that these systems are difficult to work with.

In practice we make compromises. We rely on timeouts. If one service calls another service and gets an error, or no response at all, it retries that call in the knowledge that it will get there in the end.

The problem is that retries can result in duplicate processing—which can cause very real problems. Taking a payment, twice, from someone’s account will lead to an incorrect balance. Adding duplicate tweets to a user’s feed will lead to a poor user experience. The list goes on.

I just had a discussion at SQL Saturday Albany about this exact thing, and the pain of rolling your own solutions.

Comments closed

Subscription Versus Assignment In Kafka

Published 2017-07-31 by Kevin Feasel

Paolo Patierno explains why you shouldn’t mix subscribe() and assign() in Kafka:

Another great advantage of consumers grouping is the rebalancing feature. When a consumer joins a group, if there are still enough partitions available (i.e. we haven’t reached the limit of one consumer per partition), a re-balancing starts and the partitions will be reassigned to the current consumers, plus the new one. In the same way, if a consumer leaves a group, the partitions will be reassigned to the remaining consumers.

What I have told so far it’s really true using the subscribe() method provided by the KafkaConsumerAPI. This method forces you to assign the consumer to a consumer group, setting the group.id property, because it’s needed for re-balancing. In any case, it’s not the consumer’s choice to decide the partitions it wants to read for. In general, the first consumer joins the group doing the assignment while other consumers join the group.

Read on to learn more.

Comments closed

Extracting Phone Numbers With Apache Tika

Published 2017-07-27 by Kevin Feasel

Unni Mana knows how to get your digits:

Last time, I had difficulties detecting phone numbers from different types of documents. The challenge was that I had to use different parsers to parse and extract the phone numbers. For example, to extract phone numbers from a Word document, I had to use a library that supports Word. Also, I cannot use the same library or logic to parse a PDF file. Ultimately, I need to maintain different libraries for different document types, which, as you can image, can lead to many issues.

It looks like this covers international phone numbers as well. Seems pretty interesting.

Comments closed

Kafka As A Backbone

Published 2017-07-26 by Kevin Feasel

Ben Stopford explains how to use Kafka as a backbone for a microservices architecture:

Taking a log-structured approach has an interesting side effect. Both reads and writes are sequential operations. This makes them sympathetic to the underlying media, leveraging pre-fetch, the various layers of caching and naturally batching operations together. This makes them efficient. In fact, when you read messages from Kafka, the server doesn’t even import them into the JVM. Data is copied directly from the disk buffer to the network buffer. An opportunity afforded by the simplicity of both the contract and the underlying data structure.

So batched, sequential operations help with overall performance. They also make the system well suited to storing messages longer term. Most traditional message brokers are built using index structures, hash tables or B-trees, used to manage acknowledgements, filter message headers, and remove messages when they have been read. But the downside is that these indexes must be maintained. This comes at a cost. They must be kept in memory to get good performance, limiting retention significantly. But the log is O(1) when either reading or writing messages to a partition, so whether the data is on disk or cached in memory matters far less.

This is a higher-level look and helps explain why I like Kafka so much as a message broker.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31