Spark – Curated SQL

Accessing Delta Lake via Spark Connect

Published 2025-06-23 by Kevin Feasel

I have just finished an update for the spark connect dotnet lib that contains the DeltaTable implementation so that we can now use .NET to maintain delta tables, over and above what we get out of the box by using DataFrame.Write.Format("delta"), this is an example of how to use the delta api from .NET:

Click through for the example, and you can learn more about Spark Connect from the Spark website and Ed has a .NET client for the task.

Comments closed

Spark Streaming plus Drools

Published 2025-06-11 by Kevin Feasel

Ram Ghadiyaram builds a tool:

Near real-time decision-making systems are critical for modern business applications. Integrating Apache Spark (Streaming) and Drools provides scalability and flexibility, enabling efficient handling of rule-based decision-making at scale. This article showcases their integration through a loan approval system, demonstrating its architecture, implementation, and advantages.

Click through for a bit of sample code.

Comments closed

Apache Spark 3.5 Support in Azure Synapse Analytics

Published 2025-06-03 by Kevin Feasel

Arshad Ali has an announcement:

You can now create Azure Synapse Runtime for Apache Spark 3.5. The essential changes include features which come from upgrading Apache Spark to version 3.5 and Delta Lake 3.2. Please review the official release notes for Apache Spark 3.5 to check the complete list of fixes and features. In addition, review the migration guidelines between Spark 3.4 and 3.5 to assess potential changes to your applications, jobs and notebooks.

Credit where credit is due: I’ve made light of the utter lack of work on Azure Synapse Analytics since Microsoft Fabric’s release. But hey, they did a thing. Granted, the impetus behind this was to “prepare for migrating to Microsoft Fabric Spark.”

Comments closed

What’s New in Apache Spark 4.0

Published 2025-06-03 by Kevin Feasel

Ram Ghadiyaram looks at recent updates to Apache Spark:

Hurray! Apache Spark 4.0, released in 2025, redefines big data processing with innovations that enhance performance, accessibility, and developer productivity. With contributions from over 400 developers across organizations like Databricks, Apple, and NVIDIA, Spark 4.0 resolves thousands of JIRA issues, introducing transformative features: native plotting in PySpark, Python Data Source API, polymorphic User-Defined Table Functions (UDTFs), state store enhancements, SQL scripting, and Spark Connect improvements. This report provides an in-depth exploration of these features, their technical underpinnings, and practical applications through original examples and diagrams.

Click through to see what’s on the list of major features.

Comments closed

Automated Table Statistics on Delta Tables in Microsoft Fabric

Published 2025-05-30 by Kevin Feasel

Santhosh Kumar Ravindran makes an announcement:

We’re thrilled to introduce Automated Table Statistics in Microsoft Fabric Data Engineering — a major upgrade that helps you get blazing-fast query performance with zero manual effort.

Whether you’re running complex joins, large aggregations, or heavy filtering workloads, Fabric’s new automated statistics will help Spark make smarter decisions, saving you time, compute, and money.

Click through to see what’s included, as well as the limitations associated with this. You can still create manual statistics if you’d like, so on the whole, I approve.

Comments closed

Checking Key Vault Access in Microsoft Fabric Spark Notebooks

Published 2025-05-09 by Kevin Feasel

Marc Lelijveld has clearance:

Working with sensitive data in Microsoft Fabric requires careful handling of secrets, especially when collaborating externally. In a recent customer engagement, I needed to validate access to Azure Key Vault from within a Fabric Notebook, without ever exposing the actual secret values. With only read access granted and no need to manage or update secrets, I focused on confirming that the connection was working as expected.

In this blog, I’ll walk you through the approach, including the setup, code snippets, and logic behind this quick but crucial verification step.

Click through for the full story.

Comments closed

When to Use a Python Notebook vs Spark Notebook in Microsoft Fabric

Published 2025-04-23 by Kevin Feasel

Gilbert Quevauvilliers lays out the plan:

This is the first blog post in a series of blog posts where I dive into how to use Python notebooks instead of Spark notebooks. For example, I will show you how to run a SQL query from a Lakehouse table and get it into a data frame. Read and write to a Lakehouse table and more.

NOTE: This is still in preview, but I personally think that this is worth investing time in learning.

The reason I am using the term Python is because the notebook can ONLY use Python and not any of the other languages available in a Spark

Also, in fairness, I’ve heard people working on Microsoft Fabric within the company reference these as ‘Python notebooks,’ so Gilbert is in good company.

Comments closed

Customizing Spark Settings in Microsoft Fabric Workspaces

Published 2025-04-11 by Kevin Feasel

Nikola Ilic doesn’t accept the default:

In this article, I’ll walk you through how to go from out-of-the-box default Spark configurations to a fine-tuned setup that suits your specific workloads and requirements, as well as getting you ready for the DP-700 exam.

Spark is an extremely powerful engine, but like any powerful tool, it runs best when you tune it. So, don’t always settle for default. Get dynamic—and get Spark working the way you need it to.

Click through for the explanation of functionality.

Comments closed

Common Data Model Connector for Synapse Spark 3.4

Published 2025-04-08 by Kevin Feasel

Richard Swinbank deals with a totally-not-deprecated platform:

The underlying problem here appears to be that the Spark connector is simply not supported in v3.4. There’s very little I can find to officially confirm or deny this, but an answer to this question on Microsoft Q&A backs this up. The answer also suggests a few options, including:

downgrade to Spark 3.3 – this isn’t an option because it’s end-of-life

migrate to Fabric – long term this is good idea, but it’s not a quick fix for this problem.

use alternative data access methods, e.g. using the Azure Data Lake Storage Gen2 connector.

In this article, I take a look at option (3).

Click through for Richard’s workaround.

Comments closed

Data Quality Management with Great Expectations and Databricks

Published 2025-03-27 by Kevin Feasel

Sairamakrishna BuchiReddy Karri and Srinivasarao Rayankula show off Great Expectations:

Data quality checks are critical for any production pipeline. While there are many ways to implement them, the Great Expectations library is a popular one.

Great Expectations is a powerful tool for maintaining data quality by defining, managing, and validating expectations for your data. In this article, we will discuss how you can use it to ensure data quality in your data pipelines.

Click through to see how it all works.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Spark