Kafka Streams: Kafka Streams was introduced as part of thetech preview release of the Confluent Platform few months ago and is now available through Apache Kafka 0.10.0.0. Kafka Streams is a library that turns Apache Kafka into a full featured, modern stream processing system. Kafka Streams includes a high level language for describing common stream operations (such as joining, filtering, and aggregating records), allowing developers to quickly develop powerful streaming applications. Kafka Streams offers a true event-at-a-time processing model, handles out-of-order data, allows stateful and stateless processing and can easily be deployed on many different systems— Kafka Streams applications can run on YARN, be deployed on Mesos, run in Docker containers, or just embedded into existing Java applications.
There are some nice improvements in this latest version of Kafka.
Grafana provides a powerful and customizable dashboard builder for visualizing time series data. Ambari installs Grafana v2.6 as a Master Component of AMS and adds a datasource for AMS to Grafana. The dashboard builder is supported through a Metadata API in AMS that allows easy discovery of metrics, applications and hosts which are the key components that formalize an API call to AMS. There has been significant work put into creating templated dashboards for Hadoop ecosystem services tailored towards analyzing issues and performance bottlenecks on the Hadoop cluster. The following is an image of the dashboard builder highlighting the metric name drop down with type ahead and auto complete along with options to apply aggregate functions as needed based on whether the metric is a GAUGE or a COUNTER.
This is the beginning of a good visualization system for Hadoop metrics.
Hadoop 3, as it currently stands (which is subject to change), won’t look significantly different from Hadoop 2, Ajisaka said. Made generally available in the fall of 2013, Hadoop 2 was a very big deal for the open source big data platform, as it introduced the YARN scheduler, which effectively decoupled the MapReduce processing framework from HDFS, and paved the way for other processing frameworks, such as Apache Spark, to process data on Hadoop simultaneously. That has been hugely successful for the entire Hadoop ecosystem.
It appears the list of new features in Hadoop 3 is slightly less ambitious than the Hadoop 2 undertaking. According to Ajisaka’s presentation, in addition to support for erasure coding and bug fixes, Hadoop 3 currently calls for new features like:
- shell script rewrite;
- task-level native optimization;
- the capability to derive heap size or MapReduce memory automatically;
- eliminating of old features;
- and support for more than two NameNodes.
The big benefit to erasure coding is that you can potentially cut data usage requirements in half, so that can help in very large environments. Alex also notes that the first non-beta version of Hadoop 3 is expected to release by the end of the year.
An EMR 4.6 cluster running Spark 1.6.1 will still use Python 2.7 as the default interpreter. If you want to change this, you will need to set the environment variable: PYSPARK_PYTHON=python34. You can do this when you launch a cluster by using the configurations API and supplying the configuration shown in the snippet below:
I’m more of a SQL and Scala guy, but if you like Python and are on the Python 3 side of the divide, here’s a solution for you.
Storm was originally created by Nathan Marz while he was at Backtype (later acquired by Twitter) working on analytics products based on historical and real-time analysis of the Twitter firehose. Nathan envisioned Storm as a replacement for the real-time component that was based on a cumbersome and brittle system of distributed queues and workers. Storm introduced the concept of the “stream” as a distributed abstraction for data in motion, as well as a fault tolerance and reliability model that was difficult, if not impossible, to achieve with a traditional queues and workers architecture.
Nathan open sourced Storm to GitHub on September 19th, 2011 during his talk at Strange Loop, and it quickly became the most watched JVM project on GitHub. Production deployments soon followed, and the Storm development community rapidly expanded.
Storm is an exciting technology in that it’s a key driver in making Hadoop more than just a batch processing framework.
In this article we used an artificial neural network (ANN) from Spark machine learning library as a classifier to predict emergency department deaths due to heart disease. We discussed a high-level process for feature selection, choosing number of hidden layers of the network and number of computational units. Based on that process, we found a model that achieved very good performance on test data. We observed that Spark MLlib API is simple and easy to use for training the classifier and calculating its performance metrics. In reference to Hastie et. al, we have some final comments.
Articles like this are what got me interested in data analysis to begin with.
There are several ways to keep the data updated: a cron job, a linux daemon running as a service, or a stream tool such as Streamsets.
The easiest way might be to run the task as a cron job with an interval of one to thirty seconds depending on monitoring needs. This may be suitable for a proof of concept or a small test cluster or even a production cluster. The main drawback of using a cron is that the control over the execution is limited to running the script and resources aren’t shared, meaning we are opening and closing a connection to Elasticsearch as well as doing the work to call the rest endpoint for each invocation.
Kibana makes for some pretty dashboards.
Apache Spark is a general purpose cluster computing platform which extends map-reduce to support multiple computation types including but not limited to stream processing and interactive queries. Last week IBM’s Moktar Kandil presented at the Tampa Hadoop and Tampa Data Science Group Joint meetup on the topic of exploring Apache Spark.
Following are some of the slides discussed in the meetup. To play with the ALS Recommendation engine notebook, please register at www.datascientistworkbench.com which is a free notebook for Apache Spark platform for educational purposes.
Check out the links.
What’s the difference in MapR Streams and Kafka Streams?
This one’s easy: Different technologies for different purposes. There’s a difference between messagingtechnologies (Apache Kafka, MapR Streams) versus tools for processing streaming data (such as Apache Flink, Apache Spark Streaming, Apache Apex). Kafka Streams is a soon-to-be-released processing tool for simple transformations of streaming data. The more useful comparison is between its processing capabilities and those of more full-service stream processing technologies such as Spark Streaming or Flink.
Despite the similarity in names, Kafka Streams aims at a different purpose than MapR Streams. The latter was released in January 2016. MapR Streams is a stream messaging system that is integrated into the MapR Converged Platform. Using the Apache Kafka 0.9 API, MapR Streams provides a way to deliver messages from a range of data producer types (for instance IoT sensors, machine logs, clickstream data) to consumers that include but are not limited to real-time or near real-time processing applications.
This also includes an interesting discussion of how the same term, “broker,” can be used in two different products in the same general product space and mean two distinct things.
Assuming an existing Cloudera Enterprise cluster with Impala services and HANA instances are running and that the HANA host has access to Impala daemons, configuring the integration is fairly straightforward
Install the Impala ODBC driver on the HANA host.
Configure the Impala data source.
Create remote source and virtual tables using SAP HANA Studio; then test.
There are a lot of screenshots and configuration files to help guide you through.