Press "Enter" to skip to content

Category: Spark

Connecting to Azure Databricks from Power BI

Gerhard Brueckl walks us through the Power BI connector to Azure Databricks:

I work a lot with Azure Databricks and a topic that always comes up is reporting on top of the data that is processed with Databricks. Even though notebooks offer some great ways to visualize data for analysts and power users, it is usually not the kind of report the top-management would expect. For those scenarios, you still need to use a proper reporting tool, which usually is Power BI when you are already using Azure and other Microsoft tools.

So, I am very happy that there is finally an official connector in PowerBI to access data from Azure Databricks! Previously you had to use the generic Spark connector (docs) which was rather difficult to configure and did only support authentication using a Databricks Personal Access Token.

Click through to see how it works.

Leave a Comment

From Kafka Into Azure Data Explorer

Anagha Khanolkar walks us through a data movement scenario:

Here is an end-to-end, hands-on lab showcasing the connector in action. You can see an overview of the lab below. In our lab example, we’re going to stream the Chicago crimes public dataset to Kafka on Confluent Cloud on Azure using Spark on Azure Databricks. Then, we will use the Kusto connector to stream the data from Kafka to Azure Data Explorer.

There’s also a lab to try this out, though the estimated spend is a bit high.

Leave a Comment

Finding Skew in a Spark DataFrame

Landon Robinson walks us through skew in Spark DataFrames:

Ignoring issues caused by skew can be worth it sometimes, especially if the skew is not too severe, or isn’t worth the time spent for the performance gained. This is particularly true with one-off or ad-hoc analysis that isn’t likely to be repeated, and simply needs to get done.

However, the rest of the time, we need to find out where the skew is occurring, and take steps to dissolve it and get back to processing our big data. This post will show you one way to help find the source of skew in a Spark DataFrame. It won’t delve into the handful of ways to mitigate it (repartitioning, distributing/clustering, isolation, etc) (but our new book will), but this will certainly help pinpoint where the issue may be.

Click through to learn more.

Leave a Comment

Cloning Delta Lakes

Burak Yavuz and Pranav Anand show us how to clone Delta Lakes:

Clones are replicas of a source table at a given point in time. They have the same metadata as the source table: same schema, constraints, column descriptions, statistics, and partitioning. However, they behave as a separate table with a separate lineage or history. Any changes made to clones only affect the clone and not the source. Any changes that happen to the source during or after the cloning process also do not get reflected in the clone due to Snapshot Isolation. In Databricks Delta Lake we have two types of clones: shallow or deep.

Read on to learn the differences, as well as a few useful scenarios.

Leave a Comment

Azure Synapse Analytics Query Options

James Serra has a breakdown of what can query what in Azure Synapse Analytics:

The public preview version of Azure Synapse Analytics has three compute options and four types of storage that it can access (mentioned in my blog at SQL on-demand in Azure Synapse Analytics). This gives twelve possible combinations of querying data. Not all of these combinations currently are supported and some have a few quirks of which I list below.

Read on for a table which breaks down current functionality as well as expected GA functionality.

Leave a Comment

Renaming Cached DataFrames in Spark

Landon Robinson works around an annoyance:

But DataFrames have not been given the same, clear route to convenient renaming of cached data. It has, however, been attempted and requested by the community:

https://forums.databricks.com/questions/6525/how-to-setname-on-a-dataframe.html
https://issues.apache.org/jira/browse/SPARK-8480

However, with the below approach, you can start naming your DataFrames all you want. It’s very handy.

Read on to see the solution in action.

Leave a Comment

Spark SQL in Delta Lake

Kundan Kumarr walks us through some of the basic SQL operations you can perform with Delta Lake in Apache Spark:

Nowadays Delta lake is a buzz word in the Big Data world, especially among the spark developers because it relegates lots of issues found in the Big Data domain. Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It is evolving day by day and adds cool features in its every release. On 19th June 2020, Delta lake version 0.7.0 was released and this is the first release on Spark 3.x. This release involves important key features that can make the spark developer’s work easy.

One of the interesting key features in this release is the support for metastore-defined tables and SQL DDLs. So now we can define Delta tables in the Hive metastore and use the table name in all SQL operations. We can perform SQL DDLs to create tables, insert into tables, explicitly alter the schema of the tables, and so on. So in this blog, we will learn how we can perform SQL DDLs/DMLS/DQL in Delta Lake 0.7.0.

Click through for the examples.

Leave a Comment

Join Operations in Spark

Swantika Gupta compares hash and merge join operations in Apache Spark:

One of the most frequently used transformations in Apache Spark is Join operation. Joins in Apache Spark allow the developer to combine two or more data frames based on certain (sortable) keys. The syntax for writing a join operation is simple but some times what goes on behind the curtain is lost. Internally, for Joins Apache Spark proposes a couple of Algorithms and then chooses one of them. Not knowing what these internal algorithms are, and which one does spark choose might make a simple Join operation expensive.

While opting for a Join Algorithm, Spark looks at the size of the data frames involved. It considers the Join type and condition specified, and hint (if any) to finally decide upon the algorithm to use. In most of the cases, Sort Merge join and Shuffle Hash join are the two major power horses that drive the Spark SQL joins. But if spark finds the size of one of the data frames less than a certain threshold, Spark puts up Broadcast Join as it’s top contender.

Click through for the comparison, though do note that this comparison doesn’t include nested loop joins, which are possible in Spark as well.

Comments closed

Spark SQL and Delta Lake on Spark 3.0

Denny Lee, et al, walk us through some improvements in Spark 3.0 around Spark SQL’s usage of Delta Lake:

One of most frequent questions through our Delta Lake Tech Talks was when would DML operations such as delete, update, and merge be available in Spark SQL?  Wait no more, these operations are now available in SQL!  Below are example of how you can write delete, update, and merge (insert, update, delete, and deduplication operations using Spark SQL

Read on for the full update.

Comments closed

A Quick Demo: Kafka to Spark Streaming to Cassandra

Kundan Kumarr walks us through a simple data pipeline:

Spark Structured Streaming is a component of Apache Spark framework that enables scalable, high throughput, fault tolerant processing of data streams.
Apache Kafka is a scalable, high performance, low latency platform that allows reading and writing streams of data like a messaging system.
Apache Cassandra is a distributed and wide-column NoSQL data store.

As I’m reading through this, I enjoyed just how straightforward the whole process was.

Comments closed