Cassandra To Kafka Connect

Kevin Feasel

2018-03-19

Hadoop

Mike Barlotta shows how to feed data into Kafka from Cassandra via Kafka Connect.  Part one involves basic setup:

Modeling data in Cassandra must be done around the queries that are needed to access the data (see this article for details). Typically this means that there will be one table for each query and data (in our case about the pack) will be duplicated across numerous tables.

Regardless of the other tables used for the product, the Cassandra Source connector needs a table that will allow us to query for data using a time range. The connector is designed around its ability to generate a CQL query based on configuration. It uses this query to retrieve data from the table that is available within a configurable time range. Once all of this data has been published, Kafka Connect will mark the upper end of the time range as an offset. The connector will then query the table for more data using the next time range starting with the date/time stored in the offset. We will look at how to configure this later. For now we want to focus on the constraints for the table. Since Cassandra doesn’t support joins, the table we are pulling data from must have all of the data that we want to put onto the Kafka topic. Data in other tables will not be available to Kafka Connect.

Part 2 is around tuning the connector:

One of the problems we initially had with the Cassandra Source connector was how much data it tried to process during one polling cycle. In the original versions (0.2.5 and 0.2.6) the connector would retrieve all of the data that was inserted since the last polling cycle. For systems ingesting large amounts of data this can pose a challenge.

Our logs showed that it took 6 hours to retrieve and publish 6.8 million rows of data.

The problem (or one of them) with this slow rate of ingestion was that the table was continuing to have new data inserted into it while the connector was processing the data it had retrieved. With data being added to the table faster than it was being published the connector was getting behind. Worse there was no opportunity for it to ever catch up, until there was a lull in receiving new data.

If you’re using Cassandra, this looks like a rather useful connector.

Related Posts

Push-Based Alerting With Kafka Streams

Robin Moffatt shows how to take syslog data and create a notification app using Python and Kafka Streams: Now we can query from it and show the aggregate window timestamp alongside the result: ksql> SELECT ROWTIME, TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss'), \ HOST, INVALID_LOGIN_COUNT \ FROM INVALID_USERS_LOGINS_PER_HOST; 1521644100000 | 2018-03-21 14:55:00 | rpi-03 | 1 1521646620000 | […]

Read More

Spark Architecture: The Spark Streaming Receiver

Oleksii Yermolenko gives us an overview of the Receiver object in Spark Streaming: The key component of Spark streaming application is called Receiver. It is responsible for opening new connections with the sources, listening events from them and aggregating incoming data within the memory. If receiver’s worker node is running out of memory, it starts using disk […]

Read More

Categories

March 2018
MTWTFSS
« Feb Apr »
 1234
567891011
12131415161718
19202122232425
262728293031