Apache Spark is a general purpose cluster computing platform which extends map-reduce to support multiple computation types including but not limited to stream processing and interactive queries. Last week IBM’s Moktar Kandil presented at the Tampa Hadoop and Tampa Data Science Group Joint meetup on the topic of exploring Apache Spark.
Following are some of the slides discussed in the meetup. To play with the ALS Recommendation engine notebook, please register at www.datascientistworkbench.com which is a free notebook for Apache Spark platform for educational purposes.
Check out the links.
What’s the difference in MapR Streams and Kafka Streams?
This one’s easy: Different technologies for different purposes. There’s a difference between messagingtechnologies (Apache Kafka, MapR Streams) versus tools for processing streaming data (such as Apache Flink, Apache Spark Streaming, Apache Apex). Kafka Streams is a soon-to-be-released processing tool for simple transformations of streaming data. The more useful comparison is between its processing capabilities and those of more full-service stream processing technologies such as Spark Streaming or Flink.
Despite the similarity in names, Kafka Streams aims at a different purpose than MapR Streams. The latter was released in January 2016. MapR Streams is a stream messaging system that is integrated into the MapR Converged Platform. Using the Apache Kafka 0.9 API, MapR Streams provides a way to deliver messages from a range of data producer types (for instance IoT sensors, machine logs, clickstream data) to consumers that include but are not limited to real-time or near real-time processing applications.
This also includes an interesting discussion of how the same term, “broker,” can be used in two different products in the same general product space and mean two distinct things.
Assuming an existing Cloudera Enterprise cluster with Impala services and HANA instances are running and that the HANA host has access to Impala daemons, configuring the integration is fairly straightforward
Install the Impala ODBC driver on the HANA host.
Configure the Impala data source.
Create remote source and virtual tables using SAP HANA Studio; then test.
There are a lot of screenshots and configuration files to help guide you through.
Apache Flink is an open source platform for distributed stream and batch data processing. It can run on Windows, Mac OS and Linux OS. In this blog post, let’s discuss how to set up Flink cluster locally. It is similar to Spark in many ways – it has APIs for Graph and Machine learning processing like Apache Spark – but Apache Flink and Apache Spark are not exactly the same.
To set up Flink cluster, you must have java 7.x or higher installed on your system. Since I have Hadoop-2.2.0 installed at my end on CentOS ( Linux ), I have downloaded Flink package which is compatible with Hadoop 2.x. Run below command to download Flink package.
We first receive the order ID and the total amount of the order, and then we receive the line items of the order. The first value is the item ID, the second is the order ID, (which matches the order ID value) and then the cost of the item. In this example, we have two orders. The first one has four items and the second one has only one item.
The idea is to hide all of this from our Spark application, so what it receives on the DStream is a complete order defined on a stream as follows:
Check out this practical application of Spark Streaming.
Business users have been content to perform analytics on data collected in Amazon Redshift to spot trends. But recently, they have been asking AWS whether the latency can be reduced for real-time analysis. At the same time, they want to continue using the analytical tools they’re familiar with.
In this situation, we need a system that lets you capture the data stream in real time and use SQL to analyze it in real time.
In the earlier section, you learned how to build the pipeline to Amazon Redshift with Firehose and Lambda functions. The following illustration shows how to use Apache Spark Streaming on EMR to compute time window statistics from DynamoDB Streams. The computed data can be persisted to Amazon S3 and accessed with SparkSQL using Apache Zeppelin.
There are a lot of technologies at play here and it’s worth a perusal, even though I’m going to keep recommending that you use a relational database like SQL Server for OLTP work in all but the most extreme of circumstances.
We’ll aim to predict the volume of events for the next 10 minutes using a streaming regression model, and compare those results to a traditional batch prediction method. This prediction could then be used to dynamically scale compute resources, or for other business optimization. I will start out by describing how you would do the prediction through traditional batch processing methods using both Apache Impala (incubating) and Apache Spark, and then finish by showing how to more dynamically predict usage by using Spark Streaming.
Of course, the starting point for any prediction is a freshly updated data feed for the historic volume for which I want to forecast future volume. In this case, I discovered that Meetup.com has a very nice data feed that can be used for demonstration purposes. You can read more about the API here, but all you need to know at this point is that it provides a steady stream of RSVP volume that we can use to predict future RSVP volume.
This is pretty dense, but it is a great look at one potential architecture leveraging Spark and several tools in the Hadoop ecosystem.
One thing we are proud of in Spark is creating APIs that are simple, intuitive, and expressive. Spark 2.0 continues this tradition, with focus on two areas: (1) standard SQL support and (2) unifying DataFrame/Dataset API.
On the SQL side, we have significantly expanded the SQL capabilities of Spark, with the introduction of a new ANSI SQL parser and support for subqueries. Spark 2.0 can run all the 99 TPC-DS queries, which require many of the SQL:2003 features. Because SQL has been one of the primary interfaces Spark applications use, this extended SQL capabilities drastically reduce the porting effort of legacy applications over to Spark.
There’s some great stuff coming out of DataBricks. Spark 2.0 looks to be an exciting product.
Grafana is deployed, managed and pre-configured to work with the Ambari Metrics service. We are including a curated set dashboards for core HDP components, giving operators at-a-glance views of the same metrics Hortonworks Support & Engineering review when helping customers troubleshoot complex issues.
Metrics displayed on each dashboard can be filtered by time, component, and contextual information (YARN queues for example) to provide greater flexibility, granularity and context.
Ambari is really shaping up to be a nice framework for managing a Hadoop cluster. I’m excited to see improved monitoring capabilities.
However, the logs can be corrupted. For example, the second line is a blank line, the fourth line reports some network issues and finally the last line shows a sales value of zero (which cannot happen!).
We can use accumulators to analyse the transaction log to find out the number of blank logs (blank lines), number of times the network failed, any product that does not have a category or even number of times zero sales were recorded. The full sample log can be found here.
Accumulators are applicable to any operation which are,
1. Commutative -> f(x, y) = f(y, x), and
2. Associative -> f(f(x, y), z) = f(f(x, z), y) = f(f(y, z), x)
For example, sum and max functions satisfy the above conditions whereas average does not.
Accumulators are an important way of measuring just how messy your semi-structured data is.