S3 And HDFS Data Migration

Ilya Yalovyy looks at S3DistCp, which allows you efficiently to migrate data back and forth between HDFS and S3:

Raw files often land in S3 or HDFS in an uncompressed text format. This format is suboptimal both for the cost of storage and for running analytics on that data. S3DistCp can help you efficiently store data and compress files on the fly with the --outputCodec option:

$ s3-dist-cp --src s3://my-tables/incoming/hourly_table_filtered --dest s3://my-tables/incoming/hourly_table_gz --outputCodec=gz

The current version of S3DistCp supports the codecs gzip, gz, lzo, lzop, and snappy, and the keywords none and keep (the default). These keywords have the following meaning:

  • none” – Save files uncompressed. If the files are compressed, then S3DistCp decompresses them.

  • keep” – Don’t change the compression of the files but copy them as-is.

This is an important article if you’ve got a Hadoop cluster running on EC2 nodes.

Related Posts

Optimizing Apache Flink

Ivan Mushketyk has a few tips for speeding up programs using Apache Flink: One more way to optimize your Flink application is to provide some information about what your user-defined functions are doing with input data. Since Flink can’t parse and understand code, you can provide crucial information that will help to build a more […]

Read More

Installing The Azure ML Workbench

Leila Etaati walks us through setting up the Azure ML workbench: In Microsoft ignite 2017, Azure ML team announce new on-premises tools for doing machine learning. this tools much more comprehensive as it provides 1- a workspace helps data wrangling 2- Data Visualization 3-Easy to deploy 4-Support Python codes in this post and next posts, I […]

Read More

Categories

June 2017
MTWTFSS
« May Jul »
 1234
567891011
12131415161718
19202122232425
2627282930