The log file viewer added in the Apache Storm 0.9.1 release made accessing Storm’s log files significantly easier, but in some cases still required examination individual log files one-by-one. In Storm 1.0 the UI now includes a powerful search feature that allows you to search a specific topology log file, or across all topology log files in the cluster, even archived files.
When performing a topology-wide search, the UI will search across all supervisor nodes for a match. The search results include a link to the matching log file, as well as host and port information that allow you quickly identify on which machine a specific log event occurred. This feature is particularly helpful when trying to track down when and where a particular error occurred.
The examples Taylor gives are all built around scaling. When you have dozens or hundreds of nodes, one-by-one solutions just don’t work.
Once your HBase table is ready, it’s time to map a table in Phoenix to your data in HBase. You use a JDBC connection to access Phoenix, and there are two drivers included on your cluster under /usr/lib/phoenix/bin. First, the Phoenix client connects directly to HBase processes to execute queries, which requires several ports to be open in your Amazon EC2 Security Group (for ZooKeeper, HBase Master, and RegionServers on your cluster) if your client is off-cluster.
Second, the Phoenix thin client connects to the Phoenix Query Server, which runs on port 8765 on the master node of your EMR cluster. This allows you to use a local client without adjusting your Amazon EC2 Security Groups by creating a SSH tunnel to the master node and using port forwarding for port 8765. The Phoenix Query Server is still a new component, and not all SQL clients can support the Phoenix thin client.
I am not HBase’s biggest fan, but I do think that Phoenix fixes one of the biggest problems HBase has: its being completely foreign to most data professionals. It’s not an accident that as a data platform matures, its development language looks more and more like SQL.
Single-page app: The initial page loads very quickly and asynchronously fetches the list of tables, table statistics, data sample, and partition list. Subsequent navigation clicks will trigger only 1 or 2 calls to the server, instead of reloading all the page resources again. As an added bonus, the browser history now works on all the pages.
These are some nice changes. I still don’t think a web app replaces quality tooling (like Management Studio), but if a web app is what you have, it should at least be nice.
Apache Storm was the original solution to Twitter’s problems. It was created by a marketing intelligence company called BackType, and Twitter bought the company in 2011 and eventually open-sourced Storm, providing it to the Apache Foundation.
There’s no question Storm has a lot of advantages. It’s scalable and fault-tolerant, with a decent ecosystem of “spouts,” or systems for receiving data from established sources. But it was reputedly also hard to work with and hard to get good results from, and despite a recent 1.0 renovation, it’s been challenged by other projects, including Apache Spark and its own revised streaming framework.
It’s good to see this competition in the streaming space.
Kafka Streams: Kafka Streams was introduced as part of thetech preview release of the Confluent Platform few months ago and is now available through Apache Kafka 0.10.0.0. Kafka Streams is a library that turns Apache Kafka into a full featured, modern stream processing system. Kafka Streams includes a high level language for describing common stream operations (such as joining, filtering, and aggregating records), allowing developers to quickly develop powerful streaming applications. Kafka Streams offers a true event-at-a-time processing model, handles out-of-order data, allows stateful and stateless processing and can easily be deployed on many different systems— Kafka Streams applications can run on YARN, be deployed on Mesos, run in Docker containers, or just embedded into existing Java applications.
There are some nice improvements in this latest version of Kafka.
Grafana provides a powerful and customizable dashboard builder for visualizing time series data. Ambari installs Grafana v2.6 as a Master Component of AMS and adds a datasource for AMS to Grafana. The dashboard builder is supported through a Metadata API in AMS that allows easy discovery of metrics, applications and hosts which are the key components that formalize an API call to AMS. There has been significant work put into creating templated dashboards for Hadoop ecosystem services tailored towards analyzing issues and performance bottlenecks on the Hadoop cluster. The following is an image of the dashboard builder highlighting the metric name drop down with type ahead and auto complete along with options to apply aggregate functions as needed based on whether the metric is a GAUGE or a COUNTER.
This is the beginning of a good visualization system for Hadoop metrics.
Hadoop 3, as it currently stands (which is subject to change), won’t look significantly different from Hadoop 2, Ajisaka said. Made generally available in the fall of 2013, Hadoop 2 was a very big deal for the open source big data platform, as it introduced the YARN scheduler, which effectively decoupled the MapReduce processing framework from HDFS, and paved the way for other processing frameworks, such as Apache Spark, to process data on Hadoop simultaneously. That has been hugely successful for the entire Hadoop ecosystem.
It appears the list of new features in Hadoop 3 is slightly less ambitious than the Hadoop 2 undertaking. According to Ajisaka’s presentation, in addition to support for erasure coding and bug fixes, Hadoop 3 currently calls for new features like:
- shell script rewrite;
- task-level native optimization;
- the capability to derive heap size or MapReduce memory automatically;
- eliminating of old features;
- and support for more than two NameNodes.
The big benefit to erasure coding is that you can potentially cut data usage requirements in half, so that can help in very large environments. Alex also notes that the first non-beta version of Hadoop 3 is expected to release by the end of the year.
An EMR 4.6 cluster running Spark 1.6.1 will still use Python 2.7 as the default interpreter. If you want to change this, you will need to set the environment variable: PYSPARK_PYTHON=python34. You can do this when you launch a cluster by using the configurations API and supplying the configuration shown in the snippet below:
I’m more of a SQL and Scala guy, but if you like Python and are on the Python 3 side of the divide, here’s a solution for you.
Storm was originally created by Nathan Marz while he was at Backtype (later acquired by Twitter) working on analytics products based on historical and real-time analysis of the Twitter firehose. Nathan envisioned Storm as a replacement for the real-time component that was based on a cumbersome and brittle system of distributed queues and workers. Storm introduced the concept of the “stream” as a distributed abstraction for data in motion, as well as a fault tolerance and reliability model that was difficult, if not impossible, to achieve with a traditional queues and workers architecture.
Nathan open sourced Storm to GitHub on September 19th, 2011 during his talk at Strange Loop, and it quickly became the most watched JVM project on GitHub. Production deployments soon followed, and the Storm development community rapidly expanded.
Storm is an exciting technology in that it’s a key driver in making Hadoop more than just a batch processing framework.
In this article we used an artificial neural network (ANN) from Spark machine learning library as a classifier to predict emergency department deaths due to heart disease. We discussed a high-level process for feature selection, choosing number of hidden layers of the network and number of computational units. Based on that process, we found a model that achieved very good performance on test data. We observed that Spark MLlib API is simple and easy to use for training the classifier and calculating its performance metrics. In reference to Hastie et. al, we have some final comments.
Articles like this are what got me interested in data analysis to begin with.