Spark – Page 5 – Curated SQL

Metadata-Driven Spark Clusters in Azure Databricks

Published 2024-12-10 by Kevin Feasel

Matt Collins ties the room together with a bit of metadata:

In this article, we will discuss some options for improving interoperability between Azure Orchestration tools, like Data Factory, and Databricks Spark Compute. By using some simple metadata, we will show how to dynamically configure pipelines with appropriately sized clusters for all your orchestration and transformation needs as part of a data analytics platform.

Click through for an explanation of the challenge, followed by the how-to.

Comments closed

Debugging in Databricks

Published 2024-11-19 by Kevin Feasel

Chen Hirsh enables a debugger:

Do you know that feeling, when you write beautiful code and everything just works perfectly on the first try?

I don’t.

Every time I write code It doesn’t work in the beginning, and I have to debug it, make changes, test it…

Databricks introduced a debugger you can use on a code cell, and I’ve wanted to try it for quite some time now. Well, I guess the time is now

I’m having trouble in finding the utility for a debugger here. Notebooks are already set up for debugging: you can easily add or remove cells and the underlying session maintains state between cells.

Comments closed

Updates on the Spark Connect Dotnet Library

Published 2024-11-18 by Kevin Feasel

Ed Elliott has an update for us:

There have been quite a few changes in the last couple of months and I just wanted to give a quick update on the current state of the project. In terms of usage I am starting to hear from people using the library and submitting pr’s and requests so although usage is pretty low (which is expected from the fact that the Microsoft supported version usage wasn’t very high) it is growing which is interesting.

Read on for thoughts on production readiness, support for Spark 4.0, a couple of other updates, and some future plans.

Comments closed

A Primer on SparkSQL and PySpark

Published 2024-11-06 by Kevin Feasel

Anurag K covers the basics of PySpark:

In the era of big data, efficient data processing is critical for insights-driven decision-making. PySpark SQL, a part of Apache Spark, enables data engineers and analysts to work with structured data at massive scale. Combining SQL’s simplicity with Spark’s processing power, it opens a gateway to handling vast datasets seamlessly. This comprehensive guide walks you through PySpark SQL, from foundational concepts to advanced querying techniques, with detailed code examples. Let’s dive in and master PySpark SQL for data-driven analytics.

Click through for examples covering a variety of operations you can perform.

Comments closed

Enabling System Tables on Databricks

Published 2024-11-04 by Kevin Feasel

Chen Hirsh wants to see system tables:

This post is about two things that are dependent on each other. First, it explains how to enable system tables (other than the one enabled by default) and second, how to use these system tables to view data about workflow runs and costs.

Please note that Unity Catalog is required to use these features. And a premium workspace is required for using dashboards.

Click through to learn more about what system tables are and what you can get from them.

Comments closed

Finding Mutable and Immutable Properties in Microsoft Fabric Spark

Published 2024-11-01 by Kevin Feasel

Sandeep Pawar wants to make a change:

Spark properties are divided into mutable and immutable configurations based on whether they can be safely modified during runtime after the spark session is created.

Mutable properties can be changed dynamically using spark.conf.set() without requiring a restart of the Spark application – these typically include performance tuning parameters like shuffle partitions, broadcast thresholds, AQE etc.

Immutable properties, on the other hand, are global configurations that affect core spark behavior and cluster setup and these must be set before/at session initialization as they require a fresh session to take effect.

Read on to see how you can tell which is which.

Comments closed

Power BI Automatic Aggregations and Databricks

Published 2024-10-29 by Kevin Feasel

Katie Cummiskey, et al, do a bit of caching:

Automatic aggregations streamline the process of improving BI query performance by maintaining an in-memory cache of aggregated data. This means that a substantial portion of report queries can be served directly from this in-memory cache instead of relying on the backend data sources. Power BI automatically builds these aggregations using AI based on your query patterns and then intelligently decides which queries can be served from the in-memory cache and which are routed to the data source through DirectQuery, resulting in faster visualizations and reduced load on the backend systems.

Click through to learn more about automatic aggregations, which SKUs of Power BI / Fabric are eligible, and how you can enable it for data coming from Databricks.

Comments closed

Map and FlatMap in PySpark

Published 2024-10-25 by Kevin Feasel

Vipul Kumar does a bit of work with resilient distributed datasets:

PySpark, the Python API for Apache Spark, is widely used for big data processing and distributed computing. It enables data engineers and data scientists to efficiently process large datasets using resilient distributed datasets (RDDs) and DataFrames. Two commonly used transformations in PySpark are map() and flatMap(). These functions allow users to perform operations on RDDs and are pivotal in distributed data processing.

In this blog, we’ll explore the key differences between map() and flatMap(), their use cases, and how they can be applied in PySpark.

The DataFrame approach has all but obviated having developers use the original Hadoop-like map-reduce approach to writing code in Spark. Even so, I do think it’s useful to know how it all works.

Comments closed

Support for Variant Data Type in Spark Connect DotNet

Published 2024-10-15 by Kevin Feasel

Ed Elliott has an update for us:

I recently published the latest version of the Spark Connect Dotnet library which includes support for the new Variant data type in Apache Spark 4.0 here. One of the new features of Spark 4.0 is the Variant data type which is a faster way of processing Json data (see here).

Click through to see how it all works.

Comments closed

Trying out the Databricks For-Each Task

Published 2024-10-02 by Kevin Feasel

Chen Hirsh goes in a loop:

Databricks recently added a for-each task to their workflow capability. Workflows are Databricks jobs, like Data factory pipelines, or SQL server jobs, a pipeline that you can schedule, that include a number of tasks that together complete some business logic.

Theoretically, the long-awaited for-each task should make the run of multiple processes easier. for example, one of the things I often do is run a list of notebooks, each processing a different table, with no dependencies between them. At the moment I use parallel notebooks – https://docs.databricks.com/en/notebooks/notebook-workflows.html#run-multiple-notebooks-concurrently

As you will see later, this use case is not supported yet. But before that, let’s see what we can do with the for-each task.

Read on to see what it currently can do, and what it cannot.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Category: Spark