Securing Spark Shuffle

Cheng Xu uses Apache Commons Crypto to secure data when Spark shuffles off to disk:

The basic steps can be described as follows:

  1. When a Spark job starts, it will generate encryption keys and store them in the current user’s credentials, which are shared with all executors.

  2. When shuffle happens, the shuffle writer will first compress the plaintext if compression is enabled. Spark will use the randomly generated Initial Vector (IV) and keys obtained from the credentials to encrypt the plaintext by using CryptoOutputStream from Crypto.

  3. CryptoOutputStream will encrypt the shuffle data and write it to the disk as it arrives. The first 16 bytes of the encrypted output file are preserved to store the initial vector.

  4. For the read path, the first 16 bytes are used to initialize the IV, which is provided to CryptoInputStreamalong with the user’s credentials. The decrypted data is then provided to Spark’s shuffle mechanism for further processing.

Once you have things optimized, the performance hit is surprisingly small.

Related Posts

Always Encrypted With Secure Enclaves

Jakub Szymaszek announces secure enclaves support with Always Encrypted in SQL Server 2019: The only operation SQL Server 2016 and 2017 support on encrypted database columns is equality comparison, providing you use deterministic encryption. For anything else, your apps need to download the data to perform the computations outside of the database. Similarly, if you […]

Read More

Summarizing Improvements In Spark 2.4

Anmol Sarna summarizes Apache Spark 2.4 and pushes his meme game at the same time: The next major enhancement was the addition of a lot of new built-in functions, including higher-order functions, to deal with complex data types easier.Spark 2.4 introduced 24 new built-in functions, such as  array_union, array_max/min, etc., and 5 higher-order functions, such as transform, filter, etc.The entire […]

Read More

Categories

August 2016
MTWTFSS
« Jul Sep »
1234567
891011121314
15161718192021
22232425262728
293031