Press "Enter" to skip to content

Category: Spark

Cloning Tables in Databricks

Chen Hirsh hogs the photocopier:

The simplest use case to explain why table cloning is helpful is this: Let’s say you have a large table, and you want to test some new process on it, but you don’t want to ruin the data for other processes, so you need a clean copy of your table (or multiple tables) to play with. Coping a large table might take time (Databricks does it very fast, but if it’s a big table it still takes time to copy the data) ,and what happens if you then need to change your code? you have to drop the target table, copy the source table again, and so on.

here is where cloning can be your friend.

Read on to learn about three cloning techniques. H/T Madeira Data Solutions blog.

Comments closed

Working with Excel Files in Databricks

Chen Hirsh deals with truly big data:

Excel is one of the most common data file formats, and, as data engineers, we are required to read data from it on almost every project. Excel is easy to use, and you can customize it quickly, like adding a column and changing data. But the same things that made it the go-to format for users, make it hard to read by Data platforms. Adding a column might break a pipeline, and changing datatypes, for example, adding text to a column that only held numeric data before, might cause a nasty error downstream.

Working in Databricks, you can read and write Excel files, but you need to pay attention to some pitfalls. So let’s get started, working with Excel files on Databricks!

Click through for a way to do this using PySpark. H/T Madeira Data Solutions blog.

Comments closed

Generating a Multi-Aggregate Pivot in Spark

Richard Swinbank troubleshoots an issue:

I’m using a stream watermark to handle late arriving data – basically1) my watermark enables the stream to accept data arriving up to 10 seconds late …and that’s where the problem shows up.

When I run this streaming query – in Azure Databricks I can do this simply with display(df_pivot) – I receive the error:

AnalysisException: Detected pattern of possible ‘correctness’ issue due to global watermark. The query contains stateful operation which can emit rows older than the current watermark plus allowed late record delay, which are “late rows” in downstream stateful operations and these rows can be discarded. Please refer the programming guide doc for more details. If you understand the possible risk of correctness issue and still need to run the query, you can disable this check by setting the config `spark.sql.streaming.statefulOperator.checkCorrectness.enabled` to false.

Read on to learn more about the scenario, the issue, and the solution.

Comments closed

Querying a Fabric SQL Endpoint via Notebook and T-SQL

Sandeep Pawar talks about a Spark connector:

I am not sharing anything new. The spark data warehouse connector has been available for a couple months now. It had some bugs, but it seems to be stable now. This connector allows you to query the lakehouse or warehouse endpoint in the Fabric notebook using spark. You can read the documentation for details but below is a quick pattern that you may find handy.

Despite it not being anything new, it is still interesting to see the use case of writing T-SQL instead of Spark SQL.

Comments closed

Choosing Compute Options for Databricks

Matt Collins shares some recommendations:

You can use Databricks for a vast range of applications these days. From handling streaming datasets, running Deep Learning models to populating data model fact tables with complex transformations, choosing the correct compute option can seem a lot like a stab in the dark followed by (potentially expensive) trial end error.

You can choose from an incredible range of configurations in the Databricks Compute User Interface.

This variety is comprised of the Virtual Machine (VM) category, size, availability and access mode (also referred to as family).

Determining the right compute choice for you could be answered by the classic answer “it depends”, but this guide aim to inform the decision-making process — both in terms of cost and performance.

Read on for more information.

Comments closed

Tips for Databricks Asset Bundles

Dustin Vannoy has a new video:

This post and video is covering some specific examples people have brought up when defining their Databricks Asset Bundles. The video includes a bit of review, but for more introduction please see my first post on Databricks Asset Bundles. The github repository I use will probably be first to update with new examples, however I hope to continue to add to the examples in these posts plus additional videos.

Click through to check out those tips.

Comments closed

An Overview of Spark in Microsoft Fabric

Reza Rad gives people a primer on Apache Spark:

Microsoft Fabric runs some workloads under the Spark engine, but what is it really? In this article, I’ll take you through the question of what Spark is, What benefits it has, how it is associated with Fabric, what configurations you have, and other things you need to know about it.

Reza talks a bit about history, interaction with languages, etc. As a quick addition to the languages list, you can use .NET languages like F# and C# with Spark, though it does involve setting up dotnet/spark and there are some open questions about its future. And I’m not even sure you could get it to work with Microsoft Fabric.

Comments closed

Getting the Top N Results in a PySpark Notebook

Gilbert Quevauvilliers only needs the top 1:

How to get the TopN rows using Python in Fabric Notebooks

When working with data there are sometimes weird and wonderful requirements which must be created in order to get to the desired solution.

In today’s blog post I had a situation where I wanted to get a single row with the highest duration.

Gilbert uses the Spark SQL version, specifically the Python function variant. You could also use Spark SQL and write a query using the LIMIT operator.

Comments closed

Looping through Data in Microsoft Fabric PySpark Notebooks

Gilbert Quevauvilliers builds a loop:

Continuing with my existing blog series on what I’m learning with notebooks and PySpark.

Today, I’m going to explain to you how I found a way to loop through data in a notebook.

In this example, I’m going to show you how I loop through a range of dates, which can then be used in a subsequent query to extract data by passing through each date into a DAX query.

Click through for Gilbert’s example. Here’s an alternative using something called a list comprehension. First, build a function that does what you want to do—that’d be the innards of Gilbert’s Python code, lines 31-54.

def perform_dax_query(row):
    var_Date = row["Date"]
    ...
    display(df_DAX_QueryResult)

Then, call that function for each row:

[perform_dax_query(row) for row in data_collect]

In this particular scenario, I’d personally stick with Gilbert’s composition, but in cases where you’re transforming a list of elements into a new list—for example, if you’re performing some data cleanup for each row in a list and you want the output to be a new list with cleaned-up data—then the list comprehension works really well.

Comments closed