Press "Enter" to skip to content

Category: Streaming

Scaling Kinesis Streams

Allan MacInnis shows how to scale Amazon Kinesis streams using the UpdateShardCount API call:

You also need to adjust the alarm threshold to accommodate for the new shard capacity automatically. For this example, update the alarm threshold to 80% of your new capacity (or 3200 records per second) by setting a CloudWatch alarm with an action to publish to a SNS topic when the alarm is triggered.

You can then create a Lambda function that subscribes to this SNS topic and executes a call to the new UpdateShardCount API operation while adjusting the CloudWatch alarm threshold. To learn how to configure a Cloudwatch alarm, see Creating Amazon Cloudwatch Alarms. For information about how to invoke a Lambda function from SNS, see Invoking Lambda Functions Using Amazon SNS Notifications.

This is pretty cool.

Comments closed

Stream Computing Platform

Ravi Peri shows how to set up the Stream Computing Platform for .NET (SCP.Net) library and kick off a job:

SCP.Net generates a zip file consisting of the topology DLLs and dependency jars.

It uses Java (if found in the PATH) or .net to generate the zip. Unfortunately, zip files generated with .net are not compatible with Linux clusters.

If you’re interesting in working with a Storm topology while writing .NET code, check this out.

Comments closed

Kafka Enrichment

I have an article on enriching data stored in a Kafka topic:

We’re going a bunch of setup work here, so let’s take it from the top.  First, I declare a consumer group, which I’m calling “Airplane Enricher.”  Kafka uses the concept of consumer groups to allow consumers to work in parallel.  Imagine that we have ten separate servers available to process messages from the Flights topic.  Each flight message is independent, so it doesn’t matter which consumer gets it.  What does matter, though, is that multiple consumers don’t get the same message, as that’s a waste of resources and could lead to duplicate data processing, which would be bad.

The way Kafka works around this is to use consumer groups:  within a consumer group, only one consumer will get a particular message.  That way, I can have my ten servers processing messages “for real” and maybe have another consumer in a different consumer group just reading through the messages getting a count of how many records are in the topic.  Once you treat topics as logs rather than queues, consumer design changes significantly.

This is a fairly lengthy read, but directly business-applicable, so I think it’s well worth it.

Comments closed

What Is Kafka?

I start a new series on Apache Kafka:

The broker serves several purposes:

  1. Know who the producers are and who the consumers are.  This way, the producers don’t care who exactly consumes a message and aren’t responsible for the message after they hand it off.
  2. Buffer for performance.  If the consumers are a little slow at the moment but don’t usually get overwhelmed, that’s okay—messages can sit with the broker until the consumer is ready to fetch.
  3. Let us scale out more easily.  Need to add more producers?  That’s fine—tell the broker who they are.  Need to add consumers?  Same thing.
  4. What about when a consumer goes down?  That’s the same as problem #2:  hold their messages until they’re ready again.

So brokers add a bit of complexity, but they solve some important problems.  The nice part about a broker is that it doesn’t need to know anything about the messages, only who is supposed to receive it.

This is an introduction to the product and part one of an eight-part series.

Comments closed

Kafka Consumer Groups

David Brinegar discusses consumer groups and lag in Apache Kafka:

While the Consumer Group uses the broker APIs, it is more of an application pattern or a set of behaviors embedded into your application.  The Kafka brokers are an important part of the puzzle but do not provide the Consumer Group behavior directly.  A Consumer Group based application may run on several nodes, and when they start up they coordinate with each other in order to split up the work.  This is slightly imperfect because the work, in this case, is a set of partitions defined by the Producer.  Each Consumer node can read a partition and one can split up the partitions to match the number of consumer nodes as needed.  If the number of Consumer Group nodes is more than the number of partitions, the excess nodes remain idle. This might be desirable to handle failover.  If there are more partitions than Consumer Group nodes, then some nodes will be reading more than one partition.

Read the whole thing.  It’s part one of a series.

Comments closed

Sentiment Analysis With Nifi

Timothy Spann ties together a bunch of interesting things with Apache Nifi, including integrations with Twitter, Slack, Tensorflow, and Zeppelin:

First up, I used GetTwitter to read tweets and filtered on these terms:

strata, stratahadoop, strataconf, NIFI, FutureOfData, ApacheNiFi, Hortonworks, Hadoop, ApacheHive, HBase, ApacheSpark, ApacheTez, MachineLearning, ApachePhoenix, ApacheCalcite,ApacheStorm, ApacheAtlas, ApacheKnox, Apache Ranger, HDFS, Apache Pig, Accumulo, Apache Flume, Sqoop, Apache Falcon

Input:

InvokeHttp: I used this to download the first image URL from tweets.

It’s interesting to see this all tie together relatively easily.

Comments closed

Stream Processing With Kafka And Spark

Satendra Kumar has a slide deck looking at combining Spark Streaming with Kafka:

Knoldus organized a Meetup on Friday, 9 September 2016. Topics which were covered in this meetup are:

  1. Overview of Spark Streaming.

  2. Fault-tolerance Semantics & Performance Tuning.

  3. Spark Streaming Integration with  Kafka.

Click through for the slide deck.  Combine that with the AWS blog post on the same topic and you get a pretty good intro.

Comments closed

Clickstream Anomaly Detection

Chris Marshall shows how to perform anomaly detection using AWS Kinesis Analytics:

The RANDOM_CUT_FOREST function greatly simplifies the programming required for anomaly detection.  However, understanding your data domain is paramount when performing data analytics.  The RANDOM_CUT_FOREST function is a tool for data scientists, not a replacement for them.  Knowing whether your data is logarithmic, circadian rhythmic, linear, etc. will provide the insights necessary to select the right parameters for RANDOM_CUT_FOREST.  For more information about parameters, see the RANDOM_CUT_FOREST Function.

Fortunately, the default values work in a wide variety of cases. In this case, use the default values for all but the subSampleSize parameter.  Typically, you would use a larger sample size to increase the pool of random samples used to calculate the anomaly score; for this post, use 12 samples so as to start evaluating the anomaly scores sooner.

Your SQL query outputs one record every ten seconds from the tumbling window so you’ll have enough evaluation values after two minutes to start calculating the anomaly score.  You are also using a cutoff value where records are only output to “DESTINATION_SQL_STREAM” if the anomaly score from the function is greater than 2 using the WHERE clause. To help visualize the cutoff point, here are the data points from a few runs through the pipeline using the sample Python script:

This kind of scenario is pretty cool—you could also do things like detecting service outages in streams (fewer than X events in a window, where X is some very small number relative to your overall data) or changes in advertising campaigns.

Comments closed

Real-Time Power BI Dashboards

Reza Rad builds a real-time dashboard with Stream Analytics and Power BI:

IoT Devices or Applications can pass their data to Azure Event Hub, and Azure Event hub can be used as an input to Azure Stream Analytics (which is a data streaming Azure service). Then Azure stream analytics can pass the data from input based on queries to outputs. If Power BI be used as an output then a dataset in Power BI will be generated that can be used for real-time dashboard.

As a result anytime a new data point from application or IoT device comes through Event hubs, and then Stream Analytics, Power BI dashboard will automatically update with new information.

This is a pretty nice weekend project.

Comments closed

Flink And Kafka Streams

Neha Narkhede and Stephan Ewen compare Apache Flink versus Kafka Streams:

Before Flink, users of stream processing frameworks had to make hard choices and trade off either latency, throughput, or result accuracy. Flink was the first open source framework (and still the only one), that has been demonstrated to deliver (1) throughput in the order oftens of millions of events per second in moderate clusters, (2) sub-second latency that can be as low as few 10s of milliseconds, (3) guaranteed exactly once semantics for application state, as well as exactly once end-to-end delivery with supported sources and sinks (e.g., pipelines from Kafka to Flink to HDFS or Cassandra), and (4) accurate results in the presence of out of order data arrival through its support for event time. Flink is based on a cluster architecture with master and worker nodes. Flink clusters are highly available, and can be deployed standalone or with resource managers such as YARN and Mesos. This architecture is what allows Flink to use a lightweight checkpointing mechanism to guarantee exactly-once results in the case of failures, as well allow easy and correct re-processing via savepoints without sacrificing latency or throughput. Finally, Flink is also a full-fledged batch processing framework, and, in addition to its DataStream and DataSet APIs (for stream and batch processing respectively), offers a variety of higher-level APIs and libraries, such as CEP (for Complex Event Processing), SQL and Table (for structured streams and tables), FlinkML (for Machine Learning), and Gelly (for graph processing). Flink has been proven to run very robustly in production at very large scale by several companies, powering applications that are used every day by end customers.

The upshot is that the two products don’t do exactly the same thing, and there might be room in your organization for the two of them.

Comments closed