RDBMS To Hive Via Kafka

Kevin Feasel


ETL, Hadoop

Rajesh Nadipalli shows how to use Kafka to read relational database data and feed it to Hive:

Processes that publish messages to a Kafka topic are called “producers.” “Topics” are feeds of messages in categories that Kafka maintains. The transactions from RDBMS will be converted to Kafka topics. For this example, let’s consider a database for a sales team from which transactions are published as Kafka topics. The following steps are required to set up the Kafka producer

I’d call this a non-trivial but still straightforward exercise.  Step 1 from the SQL Server side could be reading from transaction logs (which would be the least-intrusive), but you could also set up something like change tracking and fire off messages when important tables’ records change.

Related Posts

Leveraging Hive In Pyspark

Fisseha Berhane shows how to use Spark to connect Python to Hive: If we are using earlier Spark versions, we have to use HiveContext which is variant of Spark SQL that integrates with data stored in Hive. Even when we do not have an existing Hive deployment, we can still enable Hive support. In this […]

Read More

Stream Reactor Update

Andrew Stevenson announces Stream Reactor 1.0.0 for Kafka Connect 1.0: Stream Reactor is an Apache License, Version 2.0 open source collection of components built on top of Kafka and provides Kafka Connect compatible connectors to move data between Kafka and popular data stores. Stream Reactor provides source connectors to publish data into Kafka and sink connectorsto bring data from Kafka […]

Read More


August 2016
« Jul Sep »