Cheng Xu uses Apache Commons Crypto to secure data when Spark shuffles off to disk:
The basic steps can be described as follows:
-
When a Spark job starts, it will generate encryption keys and store them in the current user’s credentials, which are shared with all executors.
-
When shuffle happens, the shuffle writer will first compress the plaintext if compression is enabled. Spark will use the randomly generated Initial Vector (IV) and keys obtained from the credentials to encrypt the plaintext by using
CryptoOutputStream
from Crypto. -
CryptoOutputStream
will encrypt the shuffle data and write it to the disk as it arrives. The first 16 bytes of the encrypted output file are preserved to store the initial vector. -
For the read path, the first 16 bytes are used to initialize the IV, which is provided to
CryptoInputStream
along with the user’s credentials. The decrypted data is then provided to Spark’s shuffle mechanism for further processing.
Once you have things optimized, the performance hit is surprisingly small.