Category: Hadoop

Integrating Custom Data Sources Into Spark

Published 2016-05-16 by Kevin Feasel

Nicolas A Perez builds a custom Spark streaming data source:

We first receive the order ID and the total amount of the order, and then we receive the line items of the order. The first value is the item ID, the second is the order ID, (which matches the order ID value) and then the cost of the item. In this example, we have two orders. The first one has four items and the second one has only one item.

The idea is to hide all of this from our Spark application, so what it receives on the DStream is a complete order defined on a stream as follows:

Check out this practical application of Spark Streaming.

Comments closed

In-Memory OLTP Using Ignite

Published 2016-05-16 by Kevin Feasel

Babu Elumalai explains how to use Apache Ignite to build an in-memory OLTP system on top of Amazon’s DynamoDB:

Business users have been content to perform analytics on data collected in Amazon Redshift to spot trends. But recently, they have been asking AWS whether the latency can be reduced for real-time analysis. At the same time, they want to continue using the analytical tools they’re familiar with.

In this situation, we need a system that lets you capture the data stream in real time and use SQL to analyze it in real time.

In the earlier section, you learned how to build the pipeline to Amazon Redshift with Firehose and Lambda functions. The following illustration shows how to use Apache Spark Streaming on EMR to compute time window statistics from DynamoDB Streams. The computed data can be persisted to Amazon S3 and accessed with SparkSQL using Apache Zeppelin.

There are a lot of technologies at play here and it’s worth a perusal, even though I’m going to keep recommending that you use a relational database like SQL Server for OLTP work in all but the most extreme of circumstances.

Comments closed

Building A Prediction Engine

Published 2016-05-16 by Kevin Feasel

Richard Williamson explains how to build a prediction engine using technologies such as Spark, Kudu, Impala, and Kafka:

We’ll aim to predict the volume of events for the next 10 minutes using a streaming regression model, and compare those results to a traditional batch prediction method. This prediction could then be used to dynamically scale compute resources, or for other business optimization. I will start out by describing how you would do the prediction through traditional batch processing methods using both Apache Impala (incubating) and Apache Spark, and then finish by showing how to more dynamically predict usage by using Spark Streaming.

Of course, the starting point for any prediction is a freshly updated data feed for the historic volume for which I want to forecast future volume. In this case, I discovered that Meetup.com has a very nice data feed that can be used for demonstration purposes. You can read more about the API here, but all you need to know at this point is that it provides a steady stream of RSVP volume that we can use to predict future RSVP volume.

This is pretty dense, but it is a great look at one potential architecture leveraging Spark and several tools in the Hadoop ecosystem.

Comments closed

Spark 2.0 Technical Preview

Published 2016-05-13 by Kevin Feasel

Reynold Xin gives a preview of Apache Spark 2.0:

One thing we are proud of in Spark is creating APIs that are simple, intuitive, and expressive. Spark 2.0 continues this tradition, with focus on two areas: (1) standard SQL support and (2) unifying DataFrame/Dataset API.

On the SQL side, we have significantly expanded the SQL capabilities of Spark, with the introduction of a new ANSI SQL parser and support for subqueries. Spark 2.0 can run all the 99 TPC-DS queries, which require many of the SQL:2003 features. Because SQL has been one of the primary interfaces Spark applications use, this extended SQL capabilities drastically reduce the porting effort of legacy applications over to Spark.

There’s some great stuff coming out of DataBricks. Spark 2.0 looks to be an exciting product.

Comments closed

New Ambari Version

Published 2016-05-12 by Kevin Feasel

Paul Codding announces Ambari 2.2.2:

Grafana is deployed, managed and pre-configured to work with the Ambari Metrics service. We are including a curated set dashboards for core HDP components, giving operators at-a-glance views of the same metrics Hortonworks Support & Engineering review when helping customers troubleshoot complex issues.

Metrics displayed on each dashboard can be filtered by time, component, and contextual information (YARN queues for example) to provide greater flexibility, granularity and context.

Ambari is really shaping up to be a nice framework for managing a Hadoop cluster. I’m excited to see improved monitoring capabilities.

Comments closed

Spark Accumulators

Published 2016-05-12 by Kevin Feasel

Prithviraj Bose explains accumulators in Spark:

However, the logs can be corrupted. For example, the second line is a blank line, the fourth line reports some network issues and finally the last line shows a sales value of zero (which cannot happen!).

We can use accumulators to analyse the transaction log to find out the number of blank logs (blank lines), number of times the network failed, any product that does not have a category or even number of times zero sales were recorded. The full sample log can be found here.
Accumulators are applicable to any operation which are,
1. Commutative -> f(x, y) = f(y, x), and
2. Associative -> f(f(x, y), z) = f(f(x, z), y) = f(f(y, z), x)
For example, sum and max functions satisfy the above conditions whereas average does not.

Accumulators are an important way of measuring just how messy your semi-structured data is.

Comments closed

Installing Apache Falcon

Published 2016-05-12 by Kevin Feasel

Awanish at Edureka shows how to install Apache Falcon on your Hadoop cluster:

Apache Falcon is a framework for managing data life cycle in Hadoop clusters. It establishes relationship between various data and processing elements on a Hadoop environment, and also provides feed management services such as feed retention, replications across clusters, archival etc.

Let us first discuss how to setup Apache Falcon. Run the below given command to download git repository of Falcon:

Command: git clone https://git-wip-us.apache.org/repos/asf/falcon.git falcon

Falcon comes as part of the Hortonworks Data Platform; Cloudera has its own alternative.

Comments closed

Distributed Unit Testing

Published 2016-05-10 by Kevin Feasel

Cloudera shows off their distributed unit testing framework:

This distributed testing infrastructure started out as a Cloudera hackathon project in 2014. Todd Lipcon and I worked on a shared backend for running test tasks on a cluster, with Todd focusing on onboarding the Apache Kudu (incubating) tests, and myself on Apache Hadoop. Our prototype implementation reduced the runtime of the 1,700+ Hadoop unit tests from 8.5 hours to 15 minutes.

Since then, we’ve spent time improving the infrastructure and on-boarding additional projects. Besides Kudu and Hadoop, our distributed testing infrastructure is also being used by our Apache Hive and Apache HBase teams. We can now run all the Hadoop unit tests in less than 10 minutes!

Finally, we’re happy to announce that both our infrastructure and code are public! You can browse the webUI at http://dist-test.cloudera.org and see all the source code (ASLv2 licensed) at the cloudera/dist_test github repository. This infrastructure is already being used at upstream Apache to run the Kudu pre-commit tests.

This is an interesting look at how to scale out unit tests. It’s a bit of a long read (especially with all the videos) but worth your time.

Comments closed

Microsoft Atop Hadoop Cloud Solutions

Published 2016-05-09 by Kevin Feasel

Forrester has named Microsoft a leader in the Hadoop cloud solutions space:

This week, we’re excited that Forrester recognized Microsoft Azure as a leader in their Big Data Hadoop Cloud Solutions. Apache Hadoop as a technology has become popular amongst organizations to unlock insights from data of all size, shape, and speed. Hadoop power solutions to help businesses improve their performance, educators to better connect with the needs of their students, medical professionals to improve the quality of their care, or researchers to accelerate new advancements in science.

As an example, Ultra Tendency uses Hadoop to achieve something not possible before – visualize more than 27 million distinct sensor readings to give Japanese citizens accurate, up-to-date information about the radiation contamination from the Fukushima nuclear plant meltdown. More and more organizations are also deploying Hadoop in the cloud with 47% of Forrester’s respondents to a 2015 survey increasing their cloud deployments either by 5-10% (37%) or more than 10% (10%).¹ This makes sense because the cloud allows you to scale elastically on demand to handle the processing of any amount of data.

AWS and IBM also have very good solutions, and Google is trying to get a stronger foothold on the cloud game.

Comments closed

HBase’s Failure To Catch On

Published 2016-05-09 by Kevin Feasel

Matt Asay has an interesting article on how HBase started as a big thing but has fizzled since:

Ex-Googler (and current Amazon Web Services employee) Tim Bray argues “there is a real cost to this continuous widening of the base of knowledge a developer has to have to remain relevant.” RedMonk analyst Stephen O’Grady takes this a step further: “It could be that we’re approaching the too-much-of-a-good-thing stage. In which case, the logical outcome will be a gradual slowing of fragmentation followed by gradual consolidation.”

In other words, niche data stores that do one thing really well are giving way to more generally applicable databases that can serve a broader range of enterprise needs.

The second part of Keep’s sentence above, however, spells out another reason HBase is struggling: It’s really hard to use.

I have a statement which is 90% serious and 10% joke: a database product is truly mature once it supports SQL. So what’s the answer for HBase? The current attempt at an answer is Phoenix, which is…SQL for HBase.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31