Untangling Kafka APIs

Kevin Feasel

2018-10-30

Hadoop

Stephane Maarek helps us make sense of when to use which Kafka API:

I identify 5 types of workloads in Apache Kafka, and in my opinion each corresponds to a specific API:

  • Kafka Producer API: Applications directly producing data (ex: clickstream, logs, IoT).

  • Kafka Connect Source API: Applications bridging between a datastore we don’t control and Kafka (ex: CDC, Postgres, MongoDB, Twitter, REST API).

  • Kafka Streams API / KSQL: Applications wanting to consume from Kafka and produce back into Kafka, also called stream processing. Use KSQL if you think you can write your real-time job as SQL-like, use Kafka Streams API if you think you’re going to need to write complex logic for your job.

  • Kafka Consumer API: Read a stream and perform real-time actions on it (e.g. send email…)

  • Kafka Connect Sink API: Read a stream and store it into a target store (ex: Kafka to S3, Kafka to HDFS, Kafka to PostgreSQL, Kafka to MongoDB, etc.)

Stephane then goes into detail on each of these.

Related Posts

Performance Tuning Neural Network Training

Sean Owen takes us through a few techniques for speeding up neural network model training: Step #2: Use Early StoppingKeras (and other frameworks) have built-in support for stopping when further training appears to be making the model worse. In Keras, it’s the EarlyStopping callback. Using it means passing the validation data to the training process for evaluation […]

Read More

Machine Learning and Delta Lake

Brenner Heintz and Denny Lee walk us through solving data engineering problems with Delta Lake: As a result, companies tend to have a lot of raw, unstructured data that they’ve collected from various sources sitting stagnant in data lakes. Without a way to reliably combine historical data with real-time streaming data, and add structure to […]

Read More

Categories

October 2018
MTWTFSS
« Sep Nov »
1234567
891011121314
15161718192021
22232425262728
293031