Integrating Custom Data Sources Into Spark

Nicolas A Perez builds a custom Spark streaming data source:

We first receive the order ID and the total amount of the order, and then we receive the line items of the order. The first value is the item ID, the second is the order ID, (which matches the order ID value) and then the cost of the item. In this example, we have two orders. The first one has four items and the second one has only one item.

The idea is to hide all of this from our Spark application, so what it receives on the DStream is a complete order defined on a stream as follows:

Check out this practical application of Spark Streaming.

Related Posts

Replicating Data In HDFS Between Clusters

Murali Ramasami and Niru Anisetti have an article showing how to use the Hortonworks Data Lifecycle Manager to set up replication between two Hadoop clusters: Data Lifecycle Manager (DLM) delivers on the promise of location-agnostic, secure replication by encapsulating and copying data seamlessly across physical private storage and public cloud environments. This empowers businesses to […]

Read More

Installing Confluent Platform On Windows

Niels Berglund shows how to install Confluent Platform (the Confluent branded version of Apache Kafka) on a Windows machine using the Windows Subsystem for Linux: WSL is primarily aimed at developers, and it allows you to run Linux environments directly on Windows in a native format and without the overhead of a virtual machine. Let us […]

Read More

Categories

May 2016
MTWTFSS
« Apr Jun »
 1
2345678
9101112131415
16171819202122
23242526272829
3031