S3 And HDFS Data Migration

Ilya Yalovyy looks at S3DistCp, which allows you efficiently to migrate data back and forth between HDFS and S3:

Raw files often land in S3 or HDFS in an uncompressed text format. This format is suboptimal both for the cost of storage and for running analytics on that data. S3DistCp can help you efficiently store data and compress files on the fly with the --outputCodec option:

$ s3-dist-cp --src s3://my-tables/incoming/hourly_table_filtered --dest s3://my-tables/incoming/hourly_table_gz --outputCodec=gz

The current version of S3DistCp supports the codecs gzip, gz, lzo, lzop, and snappy, and the keywords none and keep (the default). These keywords have the following meaning:

  • none” – Save files uncompressed. If the files are compressed, then S3DistCp decompresses them.

  • keep” – Don’t change the compression of the files but copy them as-is.

This is an important article if you’ve got a Hadoop cluster running on EC2 nodes.

Related Posts

What’s New In Ambari 2.7

Paul Codding and Kat Petre share some of the new features in Ambari 2.7: With this release, we wanted to make Ambari more enjoyable to use every day, simplify finding and using our API, and unblock teams managing very large clusters.  Here is a preview of a few features we’re excited to share with you:Revamped […]

Read More

Tips On Running SQL Server In RDS

Matthew McGiffen shares some tips on running SQL Server in Amazon RDS: Or you can go with Amazon RDS (Relational Database Service).  This is more of a managed service where Amazon looks after some aspects of your database server for you. In return you give up some of the control you would have with your […]

Read More

Categories

June 2017
MTWTFSS
« May Jul »
 1234
567891011
12131415161718
19202122232425
2627282930