Kai Waehner has a nice architectural post on using Kafka as the focal point for machine learning training and prediction:
The essence of this architecture is that it uses Kafka as an intermediary between the various data sources from which feature data is collected, the model building environment where the model is fit, and the production application that serves predictions.
Feature data is pulled into Kafka from the various apps and databases that host it. This data is used to build models. The environment for this will vary based on the skills and preferred toolset of the team. The model building could be a data warehouse, a big data environment like Spark or Hadoop, or a simple server running python scripts. The model can be published where the production app that gets the same model parameters can apply it to incoming examples (perhaps using Kafka Streams to help index the feature data for easy usage on demand). The production app can either receive data from Kafka as a pipeline or even be a Kafka Streams application itself.
This is approximately 80% of my interests wrapped up in one post, so of course I’m going to read it…