Press "Enter" to skip to content

Category: Spark

Debugging in Databricks

Chen Hirsh enables a debugger:

Do you know that feeling, when you write beautiful code and everything just works perfectly on the first try?

I don’t.

Every time I write code It doesn’t work in the beginning, and I have to debug it, make changes, test it…

Databricks introduced a debugger you can use on a code cell, and I’ve wanted to try it for quite some time now. Well, I guess the time is now 

I’m having trouble in finding the utility for a debugger here. Notebooks are already set up for debugging: you can easily add or remove cells and the underlying session maintains state between cells.

Leave a Comment

Updates on the Spark Connect Dotnet Library

Ed Elliott has an update for us:

There have been quite a few changes in the last couple of months and I just wanted to give a quick update on the current state of the project. In terms of usage I am starting to hear from people using the library and submitting pr’s and requests so although usage is pretty low (which is expected from the fact that the Microsoft supported version usage wasn’t very high) it is growing which is interesting.

Read on for thoughts on production readiness, support for Spark 4.0, a couple of other updates, and some future plans.

Leave a Comment

A Primer on SparkSQL and PySpark

Anurag K covers the basics of PySpark:

In the era of big data, efficient data processing is critical for insights-driven decision-making. PySpark SQL, a part of Apache Spark, enables data engineers and analysts to work with structured data at massive scale. Combining SQL’s simplicity with Spark’s processing power, it opens a gateway to handling vast datasets seamlessly. This comprehensive guide walks you through PySpark SQL, from foundational concepts to advanced querying techniques, with detailed code examples. Let’s dive in and master PySpark SQL for data-driven analytics.

Click through for examples covering a variety of operations you can perform.

Leave a Comment

Enabling System Tables on Databricks

Chen Hirsh wants to see system tables:

This post is about two things that are dependent on each other. First, it explains how to enable system tables (other than the one enabled by default) and second, how to use these system tables to view data about workflow runs and costs.

Please note that Unity Catalog is required to use these features. And a premium workspace is required for using dashboards.

Click through to learn more about what system tables are and what you can get from them.

Leave a Comment

Finding Mutable and Immutable Properties in Microsoft Fabric Spark

Sandeep Pawar wants to make a change:

Spark properties are divided into mutable and immutable configurations based on whether they can be safely modified during runtime after the spark session is created.

Mutable properties can be changed dynamically using spark.conf.set() without requiring a restart of the Spark application – these typically include performance tuning parameters like shuffle partitions, broadcast thresholds, AQE etc.

Immutable properties, on the other hand, are global configurations that affect core spark behavior and cluster setup and these must be set before/at session initialization as they require a fresh session to take effect.

Read on to see how you can tell which is which.

Leave a Comment

Power BI Automatic Aggregations and Databricks

Katie Cummiskey, et al, do a bit of caching:

Automatic aggregations streamline the process of improving BI query performance by maintaining an in-memory cache of aggregated data. This means that a substantial portion of report queries can be served directly from this in-memory cache instead of relying on the backend data sources. Power BI automatically builds these aggregations using AI based on your query patterns and then intelligently decides which queries can be served from the in-memory cache and which are routed to the data source through DirectQuery, resulting in faster visualizations and reduced load on the backend systems.

Click through to learn more about automatic aggregations, which SKUs of Power BI / Fabric are eligible, and how you can enable it for data coming from Databricks.

Comments closed

Map and FlatMap in PySpark

Vipul Kumar does a bit of work with resilient distributed datasets:

PySpark, the Python API for Apache Spark, is widely used for big data processing and distributed computing. It enables data engineers and data scientists to efficiently process large datasets using resilient distributed datasets (RDDs) and DataFrames. Two commonly used transformations in PySpark are map() and flatMap(). These functions allow users to perform operations on RDDs and are pivotal in distributed data processing.

In this blog, we’ll explore the key differences between map() and flatMap(), their use cases, and how they can be applied in PySpark.

The DataFrame approach has all but obviated having developers use the original Hadoop-like map-reduce approach to writing code in Spark. Even so, I do think it’s useful to know how it all works.

Comments closed

Trying out the Databricks For-Each Task

Chen Hirsh goes in a loop:

Databricks recently added a for-each task to their workflow capability. Workflows are Databricks jobs, like Data factory pipelines, or SQL server jobs, a pipeline that you can schedule, that include a number of tasks that together complete some business logic.

Theoretically, the long-awaited for-each task should make the run of multiple processes easier. for example, one of the things I often do is run a list of notebooks, each processing a different table, with no dependencies between them. At the moment I use parallel notebooks – https://docs.databricks.com/en/notebooks/notebook-workflows.html#run-multiple-notebooks-concurrently

As you will see later, this use case is not supported yet. But before that, let’s see what we can do with the for-each task.

Read on to see what it currently can do, and what it cannot.

Comments closed

Tracking Microsoft Fabric Notebook Progress

Gilbert Quevauvilliers asks are we there yet? are we there yet?

How to view or track the progress of Notebook while it is running in Microsoft Fabric

I was recently working with a Notebook in Microsoft Fabric that was started via a Data Pipeline.

The challenge I had was that I had no idea how far the notebook had gone (as there were quite a lot of cells in this particular notebook).

In this blog post I am going to show you how I can use Microsoft Fabric to identify exactly which cell my notebook is currently on.

Click through for the answer. And so help me, if you ask that question one more time, I’m turning this thing around and we’re going back home.

Comments closed