Spark – Page 4 – Curated SQL

Power BI Automatic Aggregations and Databricks

Published 2024-10-29 by Kevin Feasel

Katie Cummiskey, et al, do a bit of caching:

Automatic aggregations streamline the process of improving BI query performance by maintaining an in-memory cache of aggregated data. This means that a substantial portion of report queries can be served directly from this in-memory cache instead of relying on the backend data sources. Power BI automatically builds these aggregations using AI based on your query patterns and then intelligently decides which queries can be served from the in-memory cache and which are routed to the data source through DirectQuery, resulting in faster visualizations and reduced load on the backend systems.

Click through to learn more about automatic aggregations, which SKUs of Power BI / Fabric are eligible, and how you can enable it for data coming from Databricks.

Comments closed

Map and FlatMap in PySpark

Published 2024-10-25 by Kevin Feasel

Vipul Kumar does a bit of work with resilient distributed datasets:

PySpark, the Python API for Apache Spark, is widely used for big data processing and distributed computing. It enables data engineers and data scientists to efficiently process large datasets using resilient distributed datasets (RDDs) and DataFrames. Two commonly used transformations in PySpark are map() and flatMap(). These functions allow users to perform operations on RDDs and are pivotal in distributed data processing.

In this blog, we’ll explore the key differences between map() and flatMap(), their use cases, and how they can be applied in PySpark.

The DataFrame approach has all but obviated having developers use the original Hadoop-like map-reduce approach to writing code in Spark. Even so, I do think it’s useful to know how it all works.

Comments closed

Support for Variant Data Type in Spark Connect DotNet

Published 2024-10-15 by Kevin Feasel

Ed Elliott has an update for us:

I recently published the latest version of the Spark Connect Dotnet library which includes support for the new Variant data type in Apache Spark 4.0 here. One of the new features of Spark 4.0 is the Variant data type which is a faster way of processing Json data (see here).

Click through to see how it all works.

Comments closed

Trying out the Databricks For-Each Task

Published 2024-10-02 by Kevin Feasel

Chen Hirsh goes in a loop:

Databricks recently added a for-each task to their workflow capability. Workflows are Databricks jobs, like Data factory pipelines, or SQL server jobs, a pipeline that you can schedule, that include a number of tasks that together complete some business logic.

Theoretically, the long-awaited for-each task should make the run of multiple processes easier. for example, one of the things I often do is run a list of notebooks, each processing a different table, with no dependencies between them. At the moment I use parallel notebooks – https://docs.databricks.com/en/notebooks/notebook-workflows.html#run-multiple-notebooks-concurrently

As you will see later, this use case is not supported yet. But before that, let’s see what we can do with the for-each task.

Read on to see what it currently can do, and what it cannot.

Comments closed

Tracking Microsoft Fabric Notebook Progress

Published 2024-09-26 by Kevin Feasel

Gilbert Quevauvilliers asks are we there yet? are we there yet?

How to view or track the progress of Notebook while it is running in Microsoft Fabric

I was recently working with a Notebook in Microsoft Fabric that was started via a Data Pipeline.

The challenge I had was that I had no idea how far the notebook had gone (as there were quite a lot of cells in this particular notebook).

In this blog post I am going to show you how I can use Microsoft Fabric to identify exactly which cell my notebook is currently on.

Click through for the answer. And so help me, if you ask that question one more time, I’m turning this thing around and we’re going back home.

Comments closed

Loading Entra ID Group Membership into JSON via Python Notebook

Published 2024-09-18 by Kevin Feasel

Gilbert Quevauvilliers wants to know who’s in your group:

Using a Service Principal to get all Entra ID Group Members into JSON File using a Python Notebook

Sometimes it is useful to get all Group Members into a JSON file so that this could be used for reporting purposes.

Click through for the instructions.

Comments closed

Cloning Tables in Databricks

Published 2024-09-17 by Kevin Feasel

Chen Hirsh hogs the photocopier:

The simplest use case to explain why table cloning is helpful is this: Let’s say you have a large table, and you want to test some new process on it, but you don’t want to ruin the data for other processes, so you need a clean copy of your table (or multiple tables) to play with. Coping a large table might take time (Databricks does it very fast, but if it’s a big table it still takes time to copy the data) ,and what happens if you then need to change your code? you have to drop the target table, copy the source table again, and so on.

here is where cloning can be your friend.

Read on to learn about three cloning techniques. H/T Madeira Data Solutions blog.

Comments closed

Working with Excel Files in Databricks

Published 2024-09-16 by Kevin Feasel

Chen Hirsh deals with truly big data:

Excel is one of the most common data file formats, and, as data engineers, we are required to read data from it on almost every project. Excel is easy to use, and you can customize it quickly, like adding a column and changing data. But the same things that made it the go-to format for users, make it hard to read by Data platforms. Adding a column might break a pipeline, and changing datatypes, for example, adding text to a column that only held numeric data before, might cause a nasty error downstream.

Working in Databricks, you can read and write Excel files, but you need to pay attention to some pitfalls. So let’s get started, working with Excel files on Databricks!

Click through for a way to do this using PySpark. H/T Madeira Data Solutions blog.

Comments closed

Generating a Multi-Aggregate Pivot in Spark

Published 2024-09-13 by Kevin Feasel

Richard Swinbank troubleshoots an issue:

I’m using a stream watermark to handle late arriving data – basically¹⁾ my watermark enables the stream to accept data arriving up to 10 seconds late …and that’s where the problem shows up.

When I run this streaming query – in Azure Databricks I can do this simply with display(df_pivot) – I receive the error:

AnalysisException: Detected pattern of possible ‘correctness’ issue due to global watermark. The query contains stateful operation which can emit rows older than the current watermark plus allowed late record delay, which are “late rows” in downstream stateful operations and these rows can be discarded. Please refer the programming guide doc for more details. If you understand the possible risk of correctness issue and still need to run the query, you can disable this check by setting the config `spark.sql.streaming.statefulOperator.checkCorrectness.enabled` to false.

Read on to learn more about the scenario, the issue, and the solution.

Comments closed

Querying a Fabric SQL Endpoint via Notebook and T-SQL

Published 2024-09-06 by Kevin Feasel

Sandeep Pawar talks about a Spark connector:

I am not sharing anything new. The spark data warehouse connector has been available for a couple months now. It had some bugs, but it seems to be stable now. This connector allows you to query the lakehouse or warehouse endpoint in the Fabric notebook using spark. You can read the documentation for details but below is a quick pattern that you may find handy.

Despite it not being anything new, it is still interesting to see the use case of writing T-SQL instead of Spark SQL.

Comments closed

Category: Spark