Category: Hadoop

Making Hadoop A Relational Database

Published 2018-03-27 by Kevin Feasel

Alex Woodie reports on an offering by a company named Splice Machine:

“Everybody is struggling to figure out how to expose what’s been put on the data lake to the business,” Splice Machine founder and CEO Monte Zweben told Datanami at the recent Strata Data Conference. “Our opinion is that you can take infrastructure that people understand, like relational database management systems, and run them directly on the data lake.”

That’s essentially the message that Splice has been pushing since the peak of the Hadoop frenzy in the 2013-2015 timeframe, and it’s the same message that it’s pushing today. The big difference, according to Zweben, is the maturity level. Splice Machine’s open source technology that essentially turns Hadoop into a distributed ACID-compliant relational database is now ready for primetime. Wells Fargo is arguably its biggest paying customer and production use case, but it has dozens more across financial services, healthcare and other industries.

Feasel’s Law strikes again.

Comments closed

Comparing Hadoop On EC2 To EMR

Published 2018-03-27 by Kevin Feasel

Mark Litwintschik looks at EC2 versus EMR in terms of performance, specifically targeting a solution at less than $3 per hour:

The $3.00 price point was driven by the first method: running a single-node Hadoop installation. I wanted to make sure the dataset used in this benchmark could easily fit into memory.

The price set the limit for the second method: AWS EMR. This is Amazon’s Hadoop Platform offering. It has a huge feature set but the key one is that it lets you setup Hadoop clusters with very little instruction. The $3.00 price limit includes the service fee for EMR.

Note I’ll be running everything in AWS’ Irish region and the prices mentioned are region- and in the case of spot prices, time-specific.

I was a bit surprised at which service won.

Comments closed

Continuous Processing Mode With Spark Structured Streaming

Published 2018-03-26 by Kevin Feasel

Joseph Torres, et al, explain how continuous processing mode works with Apache Spark 2.3’s structured streaming:

Suppose we want to build a real-time pipeline to flag fraudulent credit card transactions. Ideally, we want to identify and deny a fraudulent transaction as soon as the culprit has swiped his/her credit card. However, we don’t want to delay legitimate transactions as that would annoy customers. This leads to a strict upper bound on the end-to-end processing latency of our pipeline. Given that there are other delays in transit, the pipeline must process each transaction within 10-20 ms.

Let’s try to build this pipeline in Structured Streaming. Assume that we have a user-defined function “isPaymentFlagged” that can identify the fraudulent transactions. To minimize the latency, we’ll use a 0 second processing time trigger indicating that Spark should start each micro batch as fast as it can with no delays.

They also explain how this newer model differs from the prior model of collecting events in microbatches.

Comments closed

Joining Multiple Types Of Data With KSQL

Published 2018-03-23 by Kevin Feasel

Robin Moffatt has an example where he enriches streaming CSV data with information stored in MySQL:

This is a continuous query that executes in the background until explicitly terminated by the user. In effect, these are stream processing applications, and all we need to create them is SQL! Here all we’ve done is an enrichment (joining two sets of data), but we could easily add predicates to the data (simply include a WHERE clause), or even aggregations.

You can see which queries are running with the SHOW QUERIES; statement. All queries will pause if the KSQL server stops, and restart automagically when the KSQL server starts again.

The DESCRIBE EXTENDED command can be used to see information about the derived stream such as the one created above. As well as simply the columns involved, we can see information about the underlying topic, and run-time stats such as the number of messages processed and the timestamp of the most recent one.

It’s pretty easy to do; click through to see just how easy.

Comments closed

Kafka Partitioning Strategies

Published 2018-03-21 by Kevin Feasel

Amy Boyle shares some thoughts on Kafka partitioning strategy:

If you have enough load that you need more than a single instance of your application, you need to partition your data. The producer clients decide which topic partition data ends up in, but it’s what the consumer applications will do with that data that drives the decision logic. If possible, the best partitioning strategy to use is random.

However, you may need to partition on an attribute of the data if:

The consumers of the topic need to aggregate by some attribute of the data.
The consumers need some sort of ordering guarantee.
Another resource is a bottleneck and you need to shard data.
You want to concentrate data for the efficiency of storage and/or indexing.

Good advice.

Comments closed

Single-Node Hadoop 3 Installation

Published 2018-03-21 by Kevin Feasel

Mark Litwintschik has a fairly simple guide for installing Hadoop 3 on a single node for testing:

This post is meant to help people explore Hadoop 3 without feeling the need they should be using 50+ machines to do so. I’ll be using a fresh installation of Ubuntu 16.04.2 LTS on a single computer. The machine has an Intel Core i5-7300HQ CPU clocked at 2.50GHz, 8 GB of RAM and a 200 GB mechanical disk drive. I intentionally picked a low end machine to demonstrate not much is needed to try out Hadoop in a learning exercise.

Please do be mindful these instructions are aimed at building a test environment that is cut off from the outside world. Beyond the fact this is a single machine installation for software which is meant to run on multiple machines there would need to be significant content changes to turn these instructions into production installation notes.

It’s a useful guide if you’re not interested in going with one of the platform vendors like Hortonworks or Cloudera.

Comments closed

Cassandra To Kafka Connect

Published 2018-03-19 by Kevin Feasel

Mike Barlotta shows how to feed data into Kafka from Cassandra via Kafka Connect. Part one involves basic setup:

Modeling data in Cassandra must be done around the queries that are needed to access the data (see this article for details). Typically this means that there will be one table for each query and data (in our case about the pack) will be duplicated across numerous tables.

Regardless of the other tables used for the product, the Cassandra Source connector needs a table that will allow us to query for data using a time range. The connector is designed around its ability to generate a CQL query based on configuration. It uses this query to retrieve data from the table that is available within a configurable time range. Once all of this data has been published, Kafka Connect will mark the upper end of the time range as an offset. The connector will then query the table for more data using the next time range starting with the date/time stored in the offset. We will look at how to configure this later. For now we want to focus on the constraints for the table. Since Cassandra doesn’t support joins, the table we are pulling data from must have all of the data that we want to put onto the Kafka topic. Data in other tables will not be available to Kafka Connect.

Part 2 is around tuning the connector:

One of the problems we initially had with the Cassandra Source connector was how much data it tried to process during one polling cycle. In the original versions (0.2.5 and 0.2.6) the connector would retrieve all of the data that was inserted since the last polling cycle. For systems ingesting large amounts of data this can pose a challenge.

Our logs showed that it took 6 hours to retrieve and publish 6.8 million rows of data.

The problem (or one of them) with this slow rate of ingestion was that the table was continuing to have new data inserted into it while the connector was processing the data it had retrieved. With data being added to the table faster than it was being published the connector was getting behind. Worse there was no opportunity for it to ever catch up, until there was a lull in receiving new data.

If you’re using Cassandra, this looks like a rather useful connector.

Comments closed

Use Cases For Apache Kafka

Published 2018-03-19 by Kevin Feasel

Amy Boyle shows a few scenarios where New Relic uses Apache Kafka:

The Events Pipeline team is responsible for plumbing some of New Relic’s core data streams-specifically, event data. These are fine-grained nuggets of monitoring data that record a single event at a particular moment in time. For example, an event could be an error thrown by an application, a page view on a browser, or an e-commerce shopping cart transaction.

In this post, we show how we built our Kafka pipeline so that it stitches together microservices and serves as a changelog and “durable cache,” all with the idea of processing data streams as smoothly and effectively as possible at our scale. In an upcoming post, we’ll share thoughts on how we manage topic partitions in this pipeline.

If you’re wondering if Kafka might be right for you, check out this post for several scenarios which fit.

Comments closed

Event Sourcing On Kafka

Published 2018-03-15 by Kevin Feasel

Adam Warski shows how you can use Apache Kafka as your event sourcing data source:

There’s a number of great introductory articles, so this is going to be a very brief introduction. With event sourcing, instead of storing the “current” state of the entities that are used in our system, we store a stream of events that relate to these entities. Each event is a fact, it describes a state change that occurred to the entity (past tense!). As we all know, facts are indisputable and immutable. For example, suppose we had an application that saved a customer’s details. If we took an event sourcing approach, we would store every change made to that customer’s information as a stream, with the current state derived from a composition of the changes, much like a version control system does. Each individual change record in that stream would be an immutable, indisputable fact.

Having a stream of such events, it’s possible to find out what’s the current state of an entity by folding all events relating to that entity; note, however, that it’s not possible the other way round — when storing the current state only, we discard a lot of valuable historical information.

Event sourcing can peacefully co-exist with more traditional ways of storing state. A system typically handles a number of entity types (e.g. users, orders, products, …), and it’s quite possible that event sourcing is beneficial for only some of them. It’s important to remember that it’s not an all-or-nothing choice, but an additional possibility when it comes to choosing how state is managed in our application.

It’s a helpful article and works hand in hand with a CQRS pattern.

Comments closed

The Basics Of Kafka Security

Published 2018-03-15 by Kevin Feasel

Stephane Maarek has a nice post covering some of the basics of securing an Apache Kafka cluster:

Once your Kafka clients are authenticated, Kafka needs to be able to decide what they can and cannot do. This is where Authorization comes in, controlled by Access Control Lists (ACL). ACL are what you expect them to be: User A can(‘t) do Operation B on Resource C from Host D. Please note that currently with the packaged SimpleAclAuthorizer coming with Kafka, ACL are not implemented to have Groups rules or Regex-based rules. Therefore, each security rule has to be written in full (with the exception of the * wildcard).

ACL are great because they can help you prevent disasters. For example, you may have a topic that needs to be writeable from only a subset of clients or hosts. You want to prevent your average user from writing anything to these topics, hence preventing any data corruption or deserialization errors. ACLs are also great if you have some sensitive data and you need to prove to regulators that only certain applications or users can access that data.

He also has links to additional resources, which is helpful when you want to dig in further.

Comments closed