Raw files often land in S3 or HDFS in an uncompressed text format. This format is suboptimal both for the cost of storage and for running analytics on that data. S3DistCp can help you efficiently store data and compress files on the fly with the --outputCodec option:
The current version of S3DistCp supports the codecs gzip, gz, lzo, lzop, and snappy, and the keywords none and keep (the default). These keywords have the following meaning:
“none” – Save files uncompressed. If the files are compressed, then S3DistCp decompresses them.
“keep” – Don’t change the compression of the files but copy them as-is.
This is an important article if you’ve got a Hadoop cluster running on EC2 nodes.