Getting Started With Apache Flume

Kevin Feasel



Mark Litwintschik takes us through installation and configuration of Apache Flume:

The following was run on a fresh Ubuntu 16.04.2 LTS installation. The machine I’m using has an Intel Core i5 4670K clocked at 3.4 GHz, 8 GB of RAM and 1 TB of mechanical storage capacity.

First I’ve setup a standalone Hadoop environment following the instructions from my Hadoop 3 installation guide. Below I’ve installed Kafkacat for feeding and reading off of Kafka, libsnappy as I’ll be using Snappy compression on the Kafka topics, Python, Screen for running applications in the background and Zookeeper which is used by Kafka for coordination.

From there, Mark has the configuration scripts and processes to get the entire pipeline built.

Related Posts

Controlling Partition and File Counts in Spark

Landon Robinson shows how we can control the number of partitions (and therefore the number of output files) on reduce-style jobs in Spark: Whatever the case may be, the desire to control the number of files for a job or query is reasonable – within, ahem, reason – and in general is not too complicated. And, it’s often […]

Read More

Creating an Azure Databricks Cluster

Brad Llewellyn shows how you can create an Azure Databricks cluster: There are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data.  We will dig into each of these in due time.  For this post, we’re going to talk about Clusters.  Clusters are where the work is done.  Clusters themselves […]

Read More


February 2019
« Jan Mar »