Relational Data In Data Lakes

Shankar Selvam discusses one company’s tool for bringing relational data into a data lake:

The next step in building this pipeline is to configure the sink or destination for the imported data. Hydrator provides capabilities to store data in time-partitioned directories via a built-in CDAP Dataset called Time-partitioned File Set.  Once the data is stored in the fileset, CDAP automatically adds a partition which can be queried using Hive.

In this use case we will configure a Time-partitioned File Set that stores data in Avro format by usingTPFSAvro as the sink.

I like the fact that there’s a UI for this.  Between this tool and NiFi, the Hadoop ecosystem is getting some tools to make data migration easier to understand, and I think that will help adoption.

Related Posts

Testing Kafka Streams Applications

Yeva Byzek continues her series on testing Kafka-based streaming applications: When you create a stream processing application with Kafka’s Streams API, you create a Topologyeither using the StreamsBuilder DSL or the low-level Processor API. Normally, the topology runs with the KafkaStreams class, which connects to a Kafka cluster and begins processing when you call start(). For testing though, connecting to a running […]

Read More

Auto ML With SQL Server 2019 Big Data Clusters

Marco Inchiosa has a model scenario for using Big Data Clusters to scale out a machine learning problem: H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, […]

Read More

Categories

June 2016
MTWTFSS
« May Jul »
 12345
6789101112
13141516171819
20212223242526
27282930