Securing S3 Credentials In Spark Jobs

Jason Pohl shows how to protect credentials for connecting to Amazon Web Services S3 buckets when building Spark jobs:

Since Apache Spark separates compute from storage, every Spark Job requires a set of credentials to connect to disparate data sources. Storing those credentials in the clear can be a security risk if not stringently administered. To mitigate that risk, Databricks makes it easy and secure to connect to S3 with either Access Keys via DBFS or by using IAM Roles. For all other data sources (Kafka, Cassandra, RDBMS, etc.), the sensitive credentials must be managed by some other means.

This blog post will describe how to leverage an IAM Role to map to any set of credentials. It will leverage the AWS’s Key Management Service (KMS) to encrypt and decrypt the credentials so that your credentials are never in the clear at rest or in flight. When a Databricks Cluster is created using the IAM Role, it will have privileges to both read the encrypted credentials from an S3 bucket and decrypt the ciphertext with a KMS key.

That’s only one data source, but an important one.

Related Posts

NT AUTHORITY\ANONYMOUS Error Editing Procedures

Kenneth Fisher takes us through a security issue: If you have to deal with linked servers then you probably have or will run into the following error: Login failed for user ‘NT AUTHORITY\ANONYMOUS LOGON’ But I’m not trying to use the linked server. I’m trying to create/alter a stored procedure. Kenneth explains why you might […]

Read More

Avoiding the Kerberos Double-Hop Issue

Michael Bourgon shows us one extra thing to keep in mind to avoid errors when trying to use Kerberos in a double-hop situation: Yesterday I ran into the dread Kerberos Double-Hop when trying to set up a linked server.  Thought it was the standard “Add an SPN using the Microsoft Kerberos Configuration tool”.  Which didn’t […]

Read More

Categories

June 2017
MTWTFSS
« May Jul »
 1234
567891011
12131415161718
19202122232425
2627282930