The broker serves several purposes:
- Know who the producers are and who the consumers are. This way, the producers don’t care who exactly consumes a message and aren’t responsible for the message after they hand it off.
- Buffer for performance. If the consumers are a little slow at the moment but don’t usually get overwhelmed, that’s okay—messages can sit with the broker until the consumer is ready to fetch.
- Let us scale out more easily. Need to add more producers? That’s fine—tell the broker who they are. Need to add consumers? Same thing.
- What about when a consumer goes down? That’s the same as problem #2: hold their messages until they’re ready again.
So brokers add a bit of complexity, but they solve some important problems. The nice part about a broker is that it doesn’t need to know anything about the messages, only who is supposed to receive it.
This is an introduction to the product and part one of an eight-part series.
While the Consumer Group uses the broker APIs, it is more of an application pattern or a set of behaviors embedded into your application. The Kafka brokers are an important part of the puzzle but do not provide the Consumer Group behavior directly. A Consumer Group based application may run on several nodes, and when they start up they coordinate with each other in order to split up the work. This is slightly imperfect because the work, in this case, is a set of partitions defined by the Producer. Each Consumer node can read a partition and one can split up the partitions to match the number of consumer nodes as needed. If the number of Consumer Group nodes is more than the number of partitions, the excess nodes remain idle. This might be desirable to handle failover. If there are more partitions than Consumer Group nodes, then some nodes will be reading more than one partition.
Read the whole thing. It’s part one of a series.
Let’s put this architecture to the test with a realistic dataset size and workload. Our previous performance blog, “Announcing Apache Hive 2.1: 25x Faster Queries and Much More”, discussed 4 reasons that LLAP delivers dramatically faster performance versus Hive on Tez. In that benchmark we saw 25+x performance boosts on ad-hoc queries with a dataset that fit entirely into the cluster’s memory.
In most cases, datasets will be far too large to fit in RAM so we need to understand if LLAP can truly tackle the big data challenge or if it’s limited to reporting roles on smaller datasets. To find out, we scaled the dataset up to 10 TB, 4x larger than aggregate cluster RAM, and we ran a number of far more complex queries.
Table 3 below shows how Hive LLAP is capable of running both At Speed and At Scale. The simplest query in the benchmark ran in 2.68 seconds on this 10 TB dataset while the most complex query, Query 64 performed a total of 37 joins and ran for more than 20 minutes.
Given how much faster memory is than disk, and given Spark’s broad adoption, this makes sense as a strategy for Hive’s continued value.
1) Add more files to the directory, and Polybase External table will automagically read them.
2) Do INSERTS and UPDATES from PolyBase back to your files in Hadoop.
( See PolyBase – Insert data into a Hadoop Hue Directory ,
PolyBase – Insert data into new Hadoop Directory ).
3) It’s cleaner.
This is good advice. Also, if you’re using some other process to load data—for example, a map-reduce job or Spark job—you might have many smaller file chunks based on what the reducers spit out. It’s not a bad idea to cat those file chunks together, but at least if you use a folder for your external data location, your downstream processes will still work as expected regardless of how the data is laid out.
First up, I used GetTwitter to read tweets and filtered on these terms:
strata, stratahadoop, strataconf, NIFI, FutureOfData, ApacheNiFi, Hortonworks, Hadoop, ApacheHive, HBase, ApacheSpark, ApacheTez, MachineLearning, ApachePhoenix, ApacheCalcite,ApacheStorm, ApacheAtlas, ApacheKnox, Apache Ranger, HDFS, Apache Pig, Accumulo, Apache Flume, Sqoop, Apache Falcon
InvokeHttp: I used this to download the first image URL from tweets.
It’s interesting to see this all tie together relatively easily.
Next, we’ll define a DataFrame by loading data from a CSV file, which is stored in HDFS.
facebook_combined.txtcontains two columns to represent links between network nodes. The first column is called source (
src), and the second is the destination (
dst) of the link. (Some other systems, such as Gephi, use “source” and “target” instead.)
First we define a custom schema, and than we load the DataFrame, using
It sounds like Spark graph database engines are early in their lifecycle, but they might already be useful for simple analysis.
Hadoop’s ability to work with Amazon S3 storage goes back to 2006 and the issue HADOOP-574, “FileSystem implementation for Amazon S3”. This filesystem client, “s3://” implemented an inode-style filesystem atop S3: it could support bigger files than S3 could then support, some its operations (directory rename and delete) were fast. The s3 filesystem allowed Hadoop to be run in Amazon’s EMR infrastructure, using S3 as the persistent store of work. This piece of open source code predated Amazon’s release of EMR, “Elastic MapReduce” by over two years. It’s also notable as the piece of work which gained Tom White, author of “Hadoop, the Definitive Guide”, committer status.
It’s interesting to see how this project has matured over the past decade.
First, it’s interesting to note that the Polybase engine uses “pdw_user” as its user account. That’s not a blocker here because I have an open door policy on my Hadoop cluster: no security lockdown because it’s a sandbox with no important information. Second, my IP address on the main machine is 192.168.58.1 and the name node for my Hadoop sandbox is at 192.168.58.129. These logs show that my main machine runs a getfileinfo command against /tmp/ootp/secondbasemen.csv. Then, the Polybase engine asks permission to open /tmp/ootp/secondbasemen.csv and is granted permission. Then…nothing. It waits for 20-30 seconds and tries again. After four failures, it gives up. This is why it’s taking about 90 seconds to return an error message: it tries four times.
Aside from this audit log, there was nothing interesting on the Hadoop side. The YARN logs had nothing in them, indicating that whatever request happened never made it that far.
Here’s hoping there’s a solution in the future.
Overview of Spark Streaming.
Fault-tolerance Semantics & Performance Tuning.
Spark Streaming Integration with Kafka.
Click through for the slide deck. Combine that with the AWS blog post on the same topic and you get a pretty good intro.
Stream processing walkthrough
The entire pattern can be implemented in a few simple steps:
Set up Kafka on AWS.
Spin up an EMR 5.0 cluster with Hadoop, Hive, and Spark.
Create a Kafka topic.
Run the Spark Streaming app to process clickstream events.
Use the Kafka producer app to publish clickstream events into Kafka topic.
Explore clickstream events data with SparkSQL.
This is a pretty easy-to-follow walkthrough with some good tips at the end.