Storm was originally created by Nathan Marz while he was at Backtype (later acquired by Twitter) working on analytics products based on historical and real-time analysis of the Twitter firehose. Nathan envisioned Storm as a replacement for the real-time component that was based on a cumbersome and brittle system of distributed queues and workers. Storm introduced the concept of the “stream” as a distributed abstraction for data in motion, as well as a fault tolerance and reliability model that was difficult, if not impossible, to achieve with a traditional queues and workers architecture.
Nathan open sourced Storm to GitHub on September 19th, 2011 during his talk at Strange Loop, and it quickly became the most watched JVM project on GitHub. Production deployments soon followed, and the Storm development community rapidly expanded.
Storm is an exciting technology in that it’s a key driver in making Hadoop more than just a batch processing framework.
In this article we used an artificial neural network (ANN) from Spark machine learning library as a classifier to predict emergency department deaths due to heart disease. We discussed a high-level process for feature selection, choosing number of hidden layers of the network and number of computational units. Based on that process, we found a model that achieved very good performance on test data. We observed that Spark MLlib API is simple and easy to use for training the classifier and calculating its performance metrics. In reference to Hastie et. al, we have some final comments.
Articles like this are what got me interested in data analysis to begin with.
There are several ways to keep the data updated: a cron job, a linux daemon running as a service, or a stream tool such as Streamsets.
The easiest way might be to run the task as a cron job with an interval of one to thirty seconds depending on monitoring needs. This may be suitable for a proof of concept or a small test cluster or even a production cluster. The main drawback of using a cron is that the control over the execution is limited to running the script and resources aren’t shared, meaning we are opening and closing a connection to Elasticsearch as well as doing the work to call the rest endpoint for each invocation.
Kibana makes for some pretty dashboards.
Apache Spark is a general purpose cluster computing platform which extends map-reduce to support multiple computation types including but not limited to stream processing and interactive queries. Last week IBM’s Moktar Kandil presented at the Tampa Hadoop and Tampa Data Science Group Joint meetup on the topic of exploring Apache Spark.
Following are some of the slides discussed in the meetup. To play with the ALS Recommendation engine notebook, please register at www.datascientistworkbench.com which is a free notebook for Apache Spark platform for educational purposes.
Check out the links.
What’s the difference in MapR Streams and Kafka Streams?
This one’s easy: Different technologies for different purposes. There’s a difference between messagingtechnologies (Apache Kafka, MapR Streams) versus tools for processing streaming data (such as Apache Flink, Apache Spark Streaming, Apache Apex). Kafka Streams is a soon-to-be-released processing tool for simple transformations of streaming data. The more useful comparison is between its processing capabilities and those of more full-service stream processing technologies such as Spark Streaming or Flink.
Despite the similarity in names, Kafka Streams aims at a different purpose than MapR Streams. The latter was released in January 2016. MapR Streams is a stream messaging system that is integrated into the MapR Converged Platform. Using the Apache Kafka 0.9 API, MapR Streams provides a way to deliver messages from a range of data producer types (for instance IoT sensors, machine logs, clickstream data) to consumers that include but are not limited to real-time or near real-time processing applications.
This also includes an interesting discussion of how the same term, “broker,” can be used in two different products in the same general product space and mean two distinct things.
Assuming an existing Cloudera Enterprise cluster with Impala services and HANA instances are running and that the HANA host has access to Impala daemons, configuring the integration is fairly straightforward
Install the Impala ODBC driver on the HANA host.
Configure the Impala data source.
Create remote source and virtual tables using SAP HANA Studio; then test.
There are a lot of screenshots and configuration files to help guide you through.
Apache Flink is an open source platform for distributed stream and batch data processing. It can run on Windows, Mac OS and Linux OS. In this blog post, let’s discuss how to set up Flink cluster locally. It is similar to Spark in many ways – it has APIs for Graph and Machine learning processing like Apache Spark – but Apache Flink and Apache Spark are not exactly the same.
To set up Flink cluster, you must have java 7.x or higher installed on your system. Since I have Hadoop-2.2.0 installed at my end on CentOS ( Linux ), I have downloaded Flink package which is compatible with Hadoop 2.x. Run below command to download Flink package.
We first receive the order ID and the total amount of the order, and then we receive the line items of the order. The first value is the item ID, the second is the order ID, (which matches the order ID value) and then the cost of the item. In this example, we have two orders. The first one has four items and the second one has only one item.
The idea is to hide all of this from our Spark application, so what it receives on the DStream is a complete order defined on a stream as follows:
Check out this practical application of Spark Streaming.
Business users have been content to perform analytics on data collected in Amazon Redshift to spot trends. But recently, they have been asking AWS whether the latency can be reduced for real-time analysis. At the same time, they want to continue using the analytical tools they’re familiar with.
In this situation, we need a system that lets you capture the data stream in real time and use SQL to analyze it in real time.
In the earlier section, you learned how to build the pipeline to Amazon Redshift with Firehose and Lambda functions. The following illustration shows how to use Apache Spark Streaming on EMR to compute time window statistics from DynamoDB Streams. The computed data can be persisted to Amazon S3 and accessed with SparkSQL using Apache Zeppelin.
There are a lot of technologies at play here and it’s worth a perusal, even though I’m going to keep recommending that you use a relational database like SQL Server for OLTP work in all but the most extreme of circumstances.
We’ll aim to predict the volume of events for the next 10 minutes using a streaming regression model, and compare those results to a traditional batch prediction method. This prediction could then be used to dynamically scale compute resources, or for other business optimization. I will start out by describing how you would do the prediction through traditional batch processing methods using both Apache Impala (incubating) and Apache Spark, and then finish by showing how to more dynamically predict usage by using Spark Streaming.
Of course, the starting point for any prediction is a freshly updated data feed for the historic volume for which I want to forecast future volume. In this case, I discovered that Meetup.com has a very nice data feed that can be used for demonstration purposes. You can read more about the API here, but all you need to know at this point is that it provides a steady stream of RSVP volume that we can use to predict future RSVP volume.
This is pretty dense, but it is a great look at one potential architecture leveraging Spark and several tools in the Hadoop ecosystem.