Indexing S3 Data with CDP Data Hub

Eva Nahari, et al, show how to perform indexing and serving of S3 data in Cloudera Data Platform:

This blog post will present a simple “hello world” kind of example on how to get data that is stored in S3 indexed and served by an Apache Solr service hosted in a Data Discovery and Exploration cluster in CDP. For the curious: DDE is a pre-templeted Solr-optimized cluster deployment option in CDP, and recently released in tech preview. We will only cover AWS and S3 environments in this blog. Azure and ADLS deployment options are also available in tech preview, but will be covered in a future blog post.

We will depict the simplest scenario to make it easy to get started. There are of course more advanced data pipeline setups and more rich schemas possible, but this is a good starting point for a beginner. 

Read on for the instructions.