Analyzing Real-Time Data

Manjeet Chayel connects Spark Streaming to Amazon Kinesis and shows how to analyze the data in real time:

To use this post to play around with streaming data, you need an AWS account and AWS CLI configured on your machine. The entire pattern can be implemented in few simple steps:

  1. Create an Amazon Kinesis stream.

  2. Spin up an EMR cluster with Hadoop, Spark, and Zeppelin applications from advanced options.

  3. Use a Simple Java producer to push random IoT events data into the Amazon Kinesis stream.

  4. Connect to the Zeppelin notebook.

  5. Import the Zeppelin notebook from GitHub.

  6. Analyze and visualize the streaming data.

This is a good way of getting started with streaming data.  I’ve grown quite fond of notebooks in the short time that I’ve used them, as they make it very easy for people who know what they’re doing to provide code and information to people who want to know what they’re doing.

Related Posts

Testing Kafka Streams Applications

Yeva Byzek continues her series on testing Kafka-based streaming applications: When you create a stream processing application with Kafka’s Streams API, you create a Topologyeither using the StreamsBuilder DSL or the low-level Processor API. Normally, the topology runs with the KafkaStreams class, which connects to a Kafka cluster and begins processing when you call start(). For testing though, connecting to a running […]

Read More

Auto ML With SQL Server 2019 Big Data Clusters

Marco Inchiosa has a model scenario for using Big Data Clusters to scale out a machine learning problem: H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, […]

Read More

Categories

June 2016
MTWTFSS
« May Jul »
 12345
6789101112
13141516171819
20212223242526
27282930