Incorporating NiFi Into Brownfield Code

Kevin Feasel

2016-06-07

ETL, Hadoop

Paul Boal discusses how he incorporated Apache NiFi in an existing process:

Typically, data warehousing and ETL tool vendors recommended that we write your own custom components. After all, the target market for ETL tools is a space where the tools are specifically marketed as reducing the need for “error prone and time consuming” manual coding. When I ran across this tutorial on writing your own NiFi processor it occurred to me that NiFi is the exact opposite. It’s both Open Source and designed for extensibility from the ground up. I found it quite reasonable to write a custom NiFi processor that leverages our existing code base.

The existing code is a Java program with separate classes for each device vendor, all with the same interface to abstract the nuances of each vendor from the main data export program. This interface follows a traditional paradigm: login, query, query, query, logout. Given that my input to NiFi above takes in simple username, password, and query criteria arguments, it seems trivial to create a NiFi processor class that adapts the existing code into the NiFi API. Here’s a slightly abbreviated version of the actual code. (In reality, it’s all of 70 lines of code.)

In almost any realistic scenario, you’re not going to have the opportunity to start from scratch.  You will always have legacy components, external dependencies, and existing user bases to satisfy.  I like this article because it moves forward from that starting point.

Related Posts

Getting Started With Zeppelin

Sangeeta Gulia shows us how to get started building notebooks with Apache Zeppelin on top of Spark: There are 3 interpreter modes available in Zeppelin. 1) Shared Mode In Shared mode, a SparkContext and a Scala REPL is being shared among all interpreters in the group. So every Note will be sharing single SparkContext and single […]

Read More

How Per-Second AWS Billing Helps With Data Processing

Prakash Chockalingam explains how AWS per-second billing can make resource allocation easier: Because of the hourly increments in billing, users spend a lot of time playing a giant game of Tetris with their big data workloads — figuring out how to pack jobs to use every minute of the compute hour. Examples: If a job […]

Read More

Categories

June 2016
MTWTFSS
« May Jul »
 12345
6789101112
13141516171819
20212223242526
27282930