Data Governance On Apache Kafka With Lenses

Kevin Feasel

2018-05-22

Hadoop

Antonios Chalkipoulos explains how Landoop’s Lenses product helps with data governance:

One of the fundamental requirements of GDPR is the Right to Retrieve Personal Data.

With Lenses SQL the above requirement can be covered via a set of simple but thorough queries into the topics that contain PII data:

SELECT * from topicA WHERE customer.id = "XXX"

Lenses will retrieve and deserialize the data from a binary format (i.e. Avro) into a human-readable format and provide full Control Execution.

Control Execution brings into context the fact that streaming SQL is operating on un-bounded streams of events: A query would normally be a never-ending query. In order to bring query termination schemantics into Apache Kafka we introduced 4 controls:

  • LIMIT 10000 – Force the query to terminate when 10,000 records are matched

  • max.bytes = 20000000 – Force the query to terminate once 20 MBytes have been retrieved

  • max.time = 60000 – Force the query to terminate after 60 seconds

  • max.zero.polls = 8 – Force the query to terminate after 8 consecutive polls are empty, indicating we have exhausted a topic

GDPR implementation is a lot trickier for a system like Kafka, but it’s still possible.

Related Posts

It’s All ETL (Or ELT) In The End

Robin Moffatt notes that ETL (and ELT) doesn’t go away in a streaming world: In the past we used ETL techniques purely within the data-warehousing and analytic space. But, if one considers why and what ETL is doing, it is actually a lot more applicable as a broader concept. Extract: Data is available from a source system Transform: We […]

Read More

Flint: Time Series With Spark

Li Jin and Kevin Rasmussen cover the concepts of Flint, a time-series library built on Apache Spark: Time series analysis has two components: time series manipulation and time series modeling. Time series manipulation is the process of manipulating and transforming data into features for training a model. Time series manipulation is used for tasks like data […]

Read More

Categories

May 2018
MTWTFSS
« Apr Jun »
 123456
78910111213
14151617181920
21222324252627
28293031