Apache Flink is an open source platform for distributed stream and batch data processing. It can run on Windows, Mac OS and Linux OS. In this blog post, let’s discuss how to set up Flink cluster locally. It is similar to Spark in many ways – it has APIs for Graph and Machine learning processing like Apache Spark – but Apache Flink and Apache Spark are not exactly the same.
To set up Flink cluster, you must have java 7.x or higher installed on your system. Since I have Hadoop-2.2.0 installed at my end on CentOS ( Linux ), I have downloaded Flink package which is compatible with Hadoop 2.x. Run below command to download Flink package.
We first receive the order ID and the total amount of the order, and then we receive the line items of the order. The first value is the item ID, the second is the order ID, (which matches the order ID value) and then the cost of the item. In this example, we have two orders. The first one has four items and the second one has only one item.
The idea is to hide all of this from our Spark application, so what it receives on the DStream is a complete order defined on a stream as follows:
Check out this practical application of Spark Streaming.
Business users have been content to perform analytics on data collected in Amazon Redshift to spot trends. But recently, they have been asking AWS whether the latency can be reduced for real-time analysis. At the same time, they want to continue using the analytical tools they’re familiar with.
In this situation, we need a system that lets you capture the data stream in real time and use SQL to analyze it in real time.
In the earlier section, you learned how to build the pipeline to Amazon Redshift with Firehose and Lambda functions. The following illustration shows how to use Apache Spark Streaming on EMR to compute time window statistics from DynamoDB Streams. The computed data can be persisted to Amazon S3 and accessed with SparkSQL using Apache Zeppelin.
There are a lot of technologies at play here and it’s worth a perusal, even though I’m going to keep recommending that you use a relational database like SQL Server for OLTP work in all but the most extreme of circumstances.
We’ll aim to predict the volume of events for the next 10 minutes using a streaming regression model, and compare those results to a traditional batch prediction method. This prediction could then be used to dynamically scale compute resources, or for other business optimization. I will start out by describing how you would do the prediction through traditional batch processing methods using both Apache Impala (incubating) and Apache Spark, and then finish by showing how to more dynamically predict usage by using Spark Streaming.
Of course, the starting point for any prediction is a freshly updated data feed for the historic volume for which I want to forecast future volume. In this case, I discovered that Meetup.com has a very nice data feed that can be used for demonstration purposes. You can read more about the API here, but all you need to know at this point is that it provides a steady stream of RSVP volume that we can use to predict future RSVP volume.
This is pretty dense, but it is a great look at one potential architecture leveraging Spark and several tools in the Hadoop ecosystem.
One thing we are proud of in Spark is creating APIs that are simple, intuitive, and expressive. Spark 2.0 continues this tradition, with focus on two areas: (1) standard SQL support and (2) unifying DataFrame/Dataset API.
On the SQL side, we have significantly expanded the SQL capabilities of Spark, with the introduction of a new ANSI SQL parser and support for subqueries. Spark 2.0 can run all the 99 TPC-DS queries, which require many of the SQL:2003 features. Because SQL has been one of the primary interfaces Spark applications use, this extended SQL capabilities drastically reduce the porting effort of legacy applications over to Spark.
There’s some great stuff coming out of DataBricks. Spark 2.0 looks to be an exciting product.
Grafana is deployed, managed and pre-configured to work with the Ambari Metrics service. We are including a curated set dashboards for core HDP components, giving operators at-a-glance views of the same metrics Hortonworks Support & Engineering review when helping customers troubleshoot complex issues.
Metrics displayed on each dashboard can be filtered by time, component, and contextual information (YARN queues for example) to provide greater flexibility, granularity and context.
Ambari is really shaping up to be a nice framework for managing a Hadoop cluster. I’m excited to see improved monitoring capabilities.
However, the logs can be corrupted. For example, the second line is a blank line, the fourth line reports some network issues and finally the last line shows a sales value of zero (which cannot happen!).
We can use accumulators to analyse the transaction log to find out the number of blank logs (blank lines), number of times the network failed, any product that does not have a category or even number of times zero sales were recorded. The full sample log can be found here.
Accumulators are applicable to any operation which are,
1. Commutative -> f(x, y) = f(y, x), and
2. Associative -> f(f(x, y), z) = f(f(x, z), y) = f(f(y, z), x)
For example, sum and max functions satisfy the above conditions whereas average does not.
Accumulators are an important way of measuring just how messy your semi-structured data is.
Apache Falcon is a framework for managing data life cycle in Hadoop clusters. It establishes relationship between various data and processing elements on a Hadoop environment, and also provides feed management services such as feed retention, replications across clusters, archival etc.
Let us first discuss how to setup Apache Falcon. Run the below given command to download git repository of Falcon:
Command: git clone https://git-wip-us.apache.org/repos/asf/falcon.git falcon
Falcon comes as part of the Hortonworks Data Platform; Cloudera has its own alternative.
This distributed testing infrastructure started out as a Cloudera hackathon project in 2014. Todd Lipcon and I worked on a shared backend for running test tasks on a cluster, with Todd focusing on onboarding the Apache Kudu (incubating) tests, and myself on Apache Hadoop. Our prototype implementation reduced the runtime of the 1,700+ Hadoop unit tests from 8.5 hours to 15 minutes.
Since then, we’ve spent time improving the infrastructure and on-boarding additional projects. Besides Kudu and Hadoop, our distributed testing infrastructure is also being used by our Apache Hive and Apache HBase teams. We can now run all the Hadoop unit tests in less than 10 minutes!
Finally, we’re happy to announce that both our infrastructure and code are public! You can browse the webUI at http://dist-test.cloudera.org and see all the source code (ASLv2 licensed) at the cloudera/dist_test github repository. This infrastructure is already being used at upstream Apache to run the Kudu pre-commit tests.
This is an interesting look at how to scale out unit tests. It’s a bit of a long read (especially with all the videos) but worth your time.
This week, we’re excited that Forrester recognized Microsoft Azure as a leader in their Big Data Hadoop Cloud Solutions. Apache Hadoop as a technology has become popular amongst organizations to unlock insights from data of all size, shape, and speed. Hadoop power solutions to help businesses improve their performance, educators to better connect with the needs of their students, medical professionals to improve the quality of their care, or researchers to accelerate new advancements in science.
As an example, Ultra Tendency uses Hadoop to achieve something not possible before – visualize more than 27 million distinct sensor readings to give Japanese citizens accurate, up-to-date information about the radiation contamination from the Fukushima nuclear plant meltdown. More and more organizations are also deploying Hadoop in the cloud with 47% of Forrester’s respondents to a 2015 survey increasing their cloud deployments either by 5-10% (37%) or more than 10% (10%).1 This makes sense because the cloud allows you to scale elastically on demand to handle the processing of any amount of data.
AWS and IBM also have very good solutions, and Google is trying to get a stronger foothold on the cloud game.