Securely Accessing External Resources From Databricks AWS

Itai Weiss shows how you can securely hit external data sources when using Databricks for AWS:

For security purposes, Databricks Apache Spark clusters are deployed in an isolated VPC dedicated to Databricks within the customer’s account. In order to run their data workloads, there is a need to have secure connectivity between the Databricks Spark Clusters and the above data sources.

It is straightforward for Databricks clusters located within the Databricks VPC to access data from AWS S3 which is not a VPC specific service. However, we need a different solution to access data from sources deployed in other VPCs such as AWS Redshift, RDS databases, streaming data from Kinesis or Kafka. This blog will walk you through some of the options you have available to access data from these sources securely and their cost considerations for deployments on AWS. In order to establish a secure connection to these data sources, we will have to configure the Databricks VPC with either one of the following two available options :

Read on for those two options.

Related Posts

From pandas to Spark with koalas

Achilleus tries out Koalas: Python is widely used programming language when it comes to Data science workloads and Python has way too many different libraries to back this fact. Most of the data scientists are familiar with Python and pandas mostly. But the main issue with Pandas is it works great for small and medium […]

Read More

Quick Hits on Managed Instance Backup / Restore

Jovan Popovic has some pieces of advice for backing up and restoring databases on Azure SQL Managed Instances: Managed Instance takes automatic backups (full backups every week, differential every 12 hours, and log backups every 5-10 min) that you can use to restore a database to some point of time in past within the retention […]

Read More

Categories

March 2019
MTWTFSS
« Feb Apr »
 123
45678910
11121314151617
18192021222324
25262728293031