Exactly-Once Writes From Kafka To S3

Konstantine Karantasis takes us through writing from a Kafka topic into S3:

When customers were asking for an S3 connector, there were already several Kafka-to-S3 solutions out there at the time, so we had to decide whether to adopt an existing S3 connector, modify the Kafka Connect HDFS connector (as some developers attempted to do) or write a new connector from scratch.

We knew that our users needed three things from the connector:
1. Integration with the Kafka Connect API: Connect’s scaling and fault tolerance capabilities were important to have, and users didn’t want yet another system that they’d need to learn how to use, deploy and monitor.
2. Exactly once: Users didn’t want to waste expensive compute cycles on deduplicating their data. And no one likes missing events.
3. No extra dependencies: Especially dependencies on additional datastores. Kafka clients and the S3 SDK libraries should be all you need to get events from Kafka to S3. Simplicity rules, especially in a distributed systems world where simple is often the key to being reliable.

When we considered the existing connectors, we noticed that none of them delivered the reliability and exactly once capabilities we wanted. They treat S3 like it’s another file system—though it isn’t really. For example, S3 lacks file appends, it is eventually consistent, and listing a bucket is often a very slow operation.

Click through for a dive into what Confluent did and how it works.

Related Posts

Hooking SQL Server to Kafka

Niels Berglund has an interesting scenario for us: We see how the procedure in Code Snippet 2 takes relevant gameplay details and inserts them into the dbo.tb_GamePlay table. In our scenario, we want to stream the individual gameplay events, but we cannot alter the services which generate the gameplay. We instead decide to generate the event from the database […]

Read More

Oracle Data Guard on Azure

Kellyn Pot’vin-Gorman’s worlds continue to collide: So, as most people know, I’m not a big fan of Oracle RAC, (Real Application Cluster).  My opinion was that it was often sold for use cases that it doesn’t serve, (such as HA) and the resource demands between the nodes, as well as what happens when a node […]

Read More

Categories

April 2019
MTWTFSS
« Mar May »
1234567
891011121314
15161718192021
22232425262728
2930