Securing S3 Credentials In Spark Jobs

Jason Pohl shows how to protect credentials for connecting to Amazon Web Services S3 buckets when building Spark jobs:

Since Apache Spark separates compute from storage, every Spark Job requires a set of credentials to connect to disparate data sources. Storing those credentials in the clear can be a security risk if not stringently administered. To mitigate that risk, Databricks makes it easy and secure to connect to S3 with either Access Keys via DBFS or by using IAM Roles. For all other data sources (Kafka, Cassandra, RDBMS, etc.), the sensitive credentials must be managed by some other means.

This blog post will describe how to leverage an IAM Role to map to any set of credentials. It will leverage the AWS’s Key Management Service (KMS) to encrypt and decrypt the credentials so that your credentials are never in the clear at rest or in flight. When a Databricks Cluster is created using the IAM Role, it will have privileges to both read the encrypted credentials from an S3 bucket and decrypt the ciphertext with a KMS key.

That’s only one data source, but an important one.

Related Posts

Faster User-Defined Functions In SparkR

Liang Zhang and Hossein Falaki note a major performance improvement for functions in SparkR using the latest version of the Databricks Runtime: SparkR offers four APIs that run a user-defined function in R to a SparkDataFrame dapply() dapplyCollect() gapply() gapplyCollect() dapply() allows you to run an R function on each partition of the SparkDataFrame and returns […]

Read More

Database Ownership Chaining On Azure SQL Managed Instances

Jovan Popovic shows that you can enable database ownership chaining on Azure SQL Managed Instances: If you have the same owner on several objects in several databases, and you have some stored procedure that access these objects, you don’t need to GRANT access permission to every object that the procedure needs to access. If the […]

Read More


June 2017
« May Jul »