Spark – Page 12 – Curated SQL

Join Types in Spark SQL

Published 2022-12-29 by Kevin Feasel

In Apache Spark, we can use the following types of joins in SQL:

Inner join: An inner join in Apache Spark is a type of join that returns only the rows that match a given predicate in both tables. To perform an inner join in Spark using Scala, we can use the join method on a DataFrame.

The set of options is the same as you’d see in a relational database: inner, left outer, right outer, full outer, and cross. The examples here are in Scala, though would apply just as easily to PySpark and, of course, writing classic SQL statements.

Comments closed

External Objects in Databricks Unity Catalog

Published 2022-12-29 by Kevin Feasel

Meagan Longoria adds external tables and views to an Azure Databricks Unity Catalog:

I’ve been busy defining objects in my Unity Catalog metastore to create a secure exploratory environment for analysts and data scientists. I’ve found a lack of examples for doing this in Azure with file types other than delta (maybe you’re reading this in the future and this is no longer a problem, but it was when I wrote this). So I wanted to get some more examples out there in case it helps others.

I’m not storing any data in Databricks – I’m leaving my data in the data lake and using Unity Catalog to put a tabular schema on top of it (hence the use of external tables vs managed tables. In order to reference an ADLS account, you need to define a storage credential and an external location.

Read on for examples of what you can do with this.

Comments closed

Apache Spark Performance Tuning Tips

Published 2022-12-22 by Kevin Feasel

Amit Kumar shares a few tips with us:

RDD does serialisation and de-serialisation of data whenever it distributes the data across clusters such as during repartition and shuffle, and we all know that serialisation and de-serialisation are very expensive operations in spark.
On the other hand, DataFrame stores the data as binary using off-heap storage, no need for deserialization and serialization of data when it distributes to clusters. We see a big performance improvement in DataFrame over RDD

Click through for several additional tips.

Comments closed

Sharing Results between Notebooks with MSSparkUtils

Published 2022-12-16 by Kevin Feasel

Liliam Leme provides an answer to a common Synapse Spark pool question:

I’ve been reviewing customer questions centered around “Have I tried using MSSparkUtils to solve the problem?”

One of the questions asked was how to share results between notebooks. Every time you hit “run” in a notebook, it starts a new Spark cluster which means that each notebook would be using different sessions. Making it impossible to share results between executions of notebooks. MSSparkUtils offers a solution to handle this exact scenario.

Read on to see what MSSparkUtils is and how it helps in this case.

Comments closed

File Seeding with dbt

Published 2022-12-16 by Kevin Feasel

Ust Oldfield sneaks in a file:

File seeding is a useful way of maintaining and deploying static, or very slowly changing, reference data for a data model to use repeatably and reliably across environments, whilst benefitting from source control.

If you aren’t in the Databricks world, this also feels like a job for DVC.

Comments closed

Synapse Runtime for Spark 3.3 Now in Public Preview

Published 2022-12-15 by Kevin Feasel

Estera Kot has an announcement:

We are excited to announce the preview availability of Apache Spark™ 3.3 on Synapse Analytics. The essential changes include features which come from upgrading Apache Spark to version 3.3.1 and upgrading Delta Lake to version 2.1.0.

Check out the official release notes for Apache Spark 3.3.0 and Apache Spark 3.3.1 for the complete list of fixes and features. In addition, review the migration guidelines between Spark 3.2 and 3.3 to assess potential changes to your applications, jobs and notebooks.

There’s a lot in there, though I did snicker a bit at log4j 2 being more secure than log4j v1 given what we saw last year, though that gaping hole was fixed.

Comments closed

Isolated Spark Testing with lakeFS

Published 2022-12-14 by Kevin Feasel

Adi Polak demonstrates lakeFS:

This tutorial demonstrates how to build a development and testing environment for validating your logic on a full-blown production data volume and variety, working with lakeFS and Spark. You will walk through the journey of creating a repository and building a Spark application while using lakeFS capabilities. You will learn how to data changes, revert them in cases of mistakes or other hiccups, and lately merge separate branches to reflect data changes from the isolated environments.

Not too long ago, I had a couple conversations with developers and data engineers about decentralized development and devs having their own environments and data. This seems like it would be a good approach to that common problem, and it works for Azure Synapse Analytics as well.

Comments closed

InvalidAbfsRestOperationException in Synapse Managed VNet

Published 2022-12-07 by Kevin Feasel

Kamil Nowinski goes down a rabbit hole:

This happens on the customer’s Synapse workspace where we have a public network disabled, so only private endpoint and managed VNET are available. Additionally, you probably spotted, that it took over 3 minutes to actually get this message. Hence, as a next step, in order to minimize the potential causes I simplified the query to make sure I have access to the Storage, by listing the files:

Click through for a story of pain, followed by glorious resolution.

Comments closed

The Basics of dbt in Spark

Published 2022-12-06 by Kevin Feasel

Ust Oldfield provides an introduction to dbt:

dbt is an abbreviation for data build tools. It is primarily a SQL based transformation workflow, supported by yaml, to allow teams to collaborate on analytics code whilst implementing software engineering best practices like modularity, portability, CI/CD, testing, and documentation.

dbt is available using a CLI in the form of dbt core, or as a paid-for SaaS product in the form of dbt cloud.

Click through to see how the product works, including an example.

Comments closed

Data Lake Exploration in AWS with Athena for Spark

Published 2022-12-02 by Kevin Feasel

Pathik Shah and Raj Devnath jetski the data lake:

Amazon Athena now enables data analysts and data engineers to enjoy the easy-to-use, interactive, serverless experience of Athena with Apache Spark in addition to SQL. You can now use the expressive power of Python and build interactive Apache Spark applications using a simplified notebook experience on the Athena console or through Athena APIs. For interactive Spark applications, you can spend less time waiting and be more productive because Athena instantly starts running applications in less than a second. And because Athena is serverless and fully managed, analysts can run their workloads without worrying about the underlying infrastructure.

Data lakes are a common mechanism to store and analyze data because they allow companies to manage multiple data types from a wide variety of sources, and store this data, structured and unstructured, in a centralized repository. Apache Spark is a popular open-source, distributed processing system optimized for fast analytics workloads against data of any size. It’s often used to explore data lakes to derive insights. For performing interactive data explorations on the data lake, you can now use the instant-on, interactive, and fully managed Apache Spark engine in Athena. It enables you to be more productive and get started quickly, spending almost no time setting up infrastructure and Spark configurations.

In this post, we show how you can use Athena for Apache Spark to explore and derive insights from your data lake hosted on Amazon Simple Storage Service (Amazon S3).

This feels a lot like the Spark pool in Azure Synapse Analytics, as well as some of what Databricks does

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Spark