Securing Kafka-To-Spark

Mark Grover explains how to secure communications between Apache Kafka and Apache Spark:

However, to read data from secure Kafka in distributed fashion, we need Hadoop-style delegation tokens in Kafka (KAFKA-1696), support for which doesn’t exist at the time of this writing (Spring 2017).

We considered various ways to solve this problem but ultimately decided that the recommended solution to read data securely from Kafka (at least until Kafka delegation tokens support is introduced) would be for the Spark application to distribute the user’s keytab so it’s accessible to the executors. The executors will then use the user’s keytab shared with them, to authenticate with the Kerberos Key Distribution Center (KDC) and read from Kafka brokers. YARN distributed cache is used for shipping and sharing the keytab to the driver and executors, from the client (that is, the gateway node). The figure below shows an overview of the current solution.

This turns out to be a bit more difficult than I would have anticipated.

Related Posts

Kafka Offset Management With Spark Streaming

Guru Medasana and Jordan Hambleton explain how to perform Kafka offset management when using Spark Streaming: Enabling Spark Streaming’s checkpoint is the simplest method for storing offsets, as it is readily available within Spark’s framework. Streaming checkpoints are purposely designed to save the state of the application, in our case to HDFS, so that it […]

Read More

Updates In Apache Kafka

Yeva Byzek announces that Apache Kafka is shipping soon: We are very excited for the GA for Kafka release which is just days away. This release is bringing many new features as described in the previous Log Compaction blog post. The most notable new feature is Exactly Once Semantics (EOS).  Kafka’s EOS capabilities […]

Read More