Relational Data In Data Lakes

Shankar Selvam discusses one company’s tool for bringing relational data into a data lake:

The next step in building this pipeline is to configure the sink or destination for the imported data. Hydrator provides capabilities to store data in time-partitioned directories via a built-in CDAP Dataset called Time-partitioned File Set.  Once the data is stored in the fileset, CDAP automatically adds a partition which can be queried using Hive.

In this use case we will configure a Time-partitioned File Set that stores data in Avro format by usingTPFSAvro as the sink.

I like the fact that there’s a UI for this.  Between this tool and NiFi, the Hadoop ecosystem is getting some tools to make data migration easier to understand, and I think that will help adoption.

Related Posts

Flint: Time Series With Spark

Li Jin and Kevin Rasmussen cover the concepts of Flint, a time-series library built on Apache Spark: Time series analysis has two components: time series manipulation and time series modeling. Time series manipulation is the process of manipulating and transforming data into features for training a model. Time series manipulation is used for tasks like data […]

Read More

ElasticMapReduce And RStudio

Tanzir Musabbir demonstrates how to set up Amazon ElasticMapReduce to include an RStudio edge node: RStudio Server provides a browser-based interface for R and a popular tool among data scientists. Data scientist use Apache Spark cluster running on  Amazon EMR to perform distributed training. In a previous blog post, the author showed how you can install RStudio Server on Amazon […]

Read More


June 2016
« May Jul »