S3 And HDFS Data Migration

Ilya Yalovyy looks at S3DistCp, which allows you efficiently to migrate data back and forth between HDFS and S3:

Raw files often land in S3 or HDFS in an uncompressed text format. This format is suboptimal both for the cost of storage and for running analytics on that data. S3DistCp can help you efficiently store data and compress files on the fly with the --outputCodec option:

$ s3-dist-cp --src s3://my-tables/incoming/hourly_table_filtered --dest s3://my-tables/incoming/hourly_table_gz --outputCodec=gz

The current version of S3DistCp supports the codecs gzip, gz, lzo, lzop, and snappy, and the keywords none and keep (the default). These keywords have the following meaning:

  • none” – Save files uncompressed. If the files are compressed, then S3DistCp decompresses them.

  • keep” – Don’t change the compression of the files but copy them as-is.

This is an important article if you’ve got a Hadoop cluster running on EC2 nodes.

Related Posts

Kafka Offset Management With Spark Streaming

Guru Medasana and Jordan Hambleton explain how to perform Kafka offset management when using Spark Streaming: Enabling Spark Streaming’s checkpoint is the simplest method for storing offsets, as it is readily available within Spark’s framework. Streaming checkpoints are purposely designed to save the state of the application, in our case to HDFS, so that it […]

Read More

Updates In Apache Kafka

Yeva Byzek announces that Apache Kafka 0.11.0.0 is shipping soon: We are very excited for the GA for Kafka release 0.11.0.0 which is just days away. This release is bringing many new features as described in the previous Log Compaction blog post. The most notable new feature is Exactly Once Semantics (EOS).  Kafka’s EOS capabilities […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories

June 2017
MTWTFSS
« May  
 1234
567891011
12131415161718
19202122232425
2627282930