Securing Kafka-To-Spark

Mark Grover explains how to secure communications between Apache Kafka and Apache Spark:

However, to read data from secure Kafka in distributed fashion, we need Hadoop-style delegation tokens in Kafka (KAFKA-1696), support for which doesn’t exist at the time of this writing (Spring 2017).

We considered various ways to solve this problem but ultimately decided that the recommended solution to read data securely from Kafka (at least until Kafka delegation tokens support is introduced) would be for the Spark application to distribute the user’s keytab so it’s accessible to the executors. The executors will then use the user’s keytab shared with them, to authenticate with the Kerberos Key Distribution Center (KDC) and read from Kafka brokers. YARN distributed cache is used for shipping and sharing the keytab to the driver and executors, from the client (that is, the gateway node). The figure below shows an overview of the current solution.

This turns out to be a bit more difficult than I would have anticipated.

Related Posts

Avro Schemas In Kafka

Stephane Maarek explains the value of using Apache Avro as a schema structure for your Kafka topics: Avro has support for primitive types ( int, string, long, bytes, etc…), complex types (enum, arrays, unions, optionals), logical types (dates, timestamp-millis, decimal), and data record (name and namespace). All the types you’ll ever need. Avro has support for embedded documentation. Although documentation is optional, in my workflow I […]

Read More

When Spark Meets Hive

Anna Martin and Rosaria Silipo look at combining HiveQL and SparkQL: We set our goal here to investigate the age distribution of Maine residents, men and women, using SQL queries. But the question is… on Apache Hive or on Apache Spark? Well, why not both? We could use SparkSQL to extract men’s age distribution and […]

Read More

Categories