Press "Enter" to skip to content

Category: Spark

Change Data Capture in Delta Lake

Surya Sai Turaga and John O’Dwyer take us through change data capture in Delta Lake:

Change data capture (CDC) is a use case that we see many customers implement in Databricks – you can check out our previous deep dive on the topic here. Typically we see CDC used in an ingestion to analytics architecture called the medallion architecture. The medallion architecture that takes raw data landed from source systems and refines the data through bronze, silver and gold tables. CDC and the medallion architecture provide multiple benefits to users since only changed or added data needs to be processed. In addition, the different tables in the architecture allow different personas, such as Data Scientists and BI Analysts, to use the correct up-to-date data for their needs. We are happy to announce the exciting new Change Data Feed (CDF) feature in Delta Lake that makes this architecture simpler to implement and the MERGE operation and log versioning of Delta Lake possible!

Read on to gain an understanding of how it works.

Comments closed

Announcements from Data+AI Summit

Ryan Boyd summarizes Databricks announcements:

The Delta Lake open source project is a key enabler of the lakehouse, as it fixes many of the limitations of data lakes: data quality, performance and governance. The project has come a long way since its initial release, and the Delta Lake 1.0 release was just certified by the community. The release represents a variety of new features, including generated columns and cloud independence with multi-cluster writes and my favorite — Delta Lake standalone, which reads from Delta tables but doesn’t require Apache SparkTM.

We also announced a bunch of new committers to the Delta Lake project, including QP Hou, R.Tyler Croy, Christian Williams, Mykhailo Osypov and Florian Valeye.

Learn more about Delta Lake 1.0 in the keynotes from co-creator and Distinguished Engineer Michael Armbrust.

Read on for a variety of announcements in this vein.

Comments closed

Securing Databricks on AWS

Andrew Weaver, et al, take us through security practices for running Databricks on AWS:

In this article, we will share a list of cloud security features and capabilities that an enterprise data team can use to harden their Databricks environment on AWS as per their risk profile and governance policy. For more information about how Databricks runs on Amazon Web Services (AWS), view the AWS web page and Databricks security on AWS page for more specific details on security and compliance.

Click through for that list.

Comments closed

Azure Synapse Analytics Supports Apache Spark 3.0

Euan Garden has some great news for us:

Starting today, the Apache Spark 3.0 runtime is now available in Azure Synapse. This version builds on top of existing open source and Microsoft specific enhancements to include additional unique improvements listed below. The combination of these enhancements results in a significantly faster processing capability than the open-source Spark 3.0.2 and 2.4.

The public preview announced today starts with the foundation based on the open-source Apache Spark 3.0 branch with subsequent updates leading up to a Generally Available version derived from the latest 3.1 branch.

It still won’t be as fast as Databricks, but it should be a good bit faster than the Spark 2 they were running.

Comments closed

Broadcast Variables in Apache Spark

The Hadoop in Real World team explains the notion of broadcast variables in Apache Spark:

Broadcast variables are variables which are available in all executors executing the Spark application. These variables are already cached and ready to be used by tasks executing as part of the application. Broadcast variables are sent to the executors only once and it is available for all tasks executing in the executors.

Read on to understand when they are useful and, just as importantly, when not to use them. They seem like the type of thing which a newer developer could easily misuse.

Comments closed

reduceByKey and aggregateByKey in Spark

The Hadoop in Real World team compares two functions against RDDs in Spark:

Let’s examine the below aggregateByKey. The first parameter – 0 is the initial value and also indicates the type of the output.

First _+_  function indicates the function on the map side combine and second _+_ function indicates the reduce side combine. Both functions are the same in this case.

This is a demo-driven post, so check it out.

Comments closed

Querying Serverless SQL Pools from Spark Notebooks in Scala

Jovan Popovic shows off one integration point between the data services in Azure Synapse Analytics:

Azure Synapse Analytics provides multiple query runtimes that you can use to query in-database or external data. You have the choice to use T-SQL queries using a serverless Synapse SQL pool or notebooks in Apache Spark for Synapse analytics to analyze your data.

You can also connect these runtimes and run the queries from Spark notebooks on a dedicated SQL pool.

In this post, you will see how to create Scala code in a Spark notebook that executes a T-SQL query on a serverless SQL pool.

Read on to see how. You can also query Spark pool and dedicated SQL pool tables from serverless SQL pools.

4 Comments

Geospatial Fraud Detection

Antoine Amend uses Databricks to identify financial fraud in a geographical area:

As part of this real-world solution, we are releasing a new open source geospatial library, GEOSCAN, to detect geospatial behaviors at massive scale, track customers patterns over time and detect anomalous card transactions. Finally, we demonstrate how organizations can surface anomalies from an analytics environment to an online data store (ODS) with tight SLA requirements following a Lambda-like infrastructure underpinned by Delta Lake, Apache Spark and MLflow.

Click through for the article, as well as three notebooks.

Comments closed