Press "Enter" to skip to content

Category: Spark

Choosing Compute Options for Databricks

Matt Collins shares some recommendations:

You can use Databricks for a vast range of applications these days. From handling streaming datasets, running Deep Learning models to populating data model fact tables with complex transformations, choosing the correct compute option can seem a lot like a stab in the dark followed by (potentially expensive) trial end error.

You can choose from an incredible range of configurations in the Databricks Compute User Interface.

This variety is comprised of the Virtual Machine (VM) category, size, availability and access mode (also referred to as family).

Determining the right compute choice for you could be answered by the classic answer “it depends”, but this guide aim to inform the decision-making process — both in terms of cost and performance.

Read on for more information.

Comments closed

Tips for Databricks Asset Bundles

Dustin Vannoy has a new video:

This post and video is covering some specific examples people have brought up when defining their Databricks Asset Bundles. The video includes a bit of review, but for more introduction please see my first post on Databricks Asset Bundles. The github repository I use will probably be first to update with new examples, however I hope to continue to add to the examples in these posts plus additional videos.

Click through to check out those tips.

Comments closed

An Overview of Spark in Microsoft Fabric

Reza Rad gives people a primer on Apache Spark:

Microsoft Fabric runs some workloads under the Spark engine, but what is it really? In this article, I’ll take you through the question of what Spark is, What benefits it has, how it is associated with Fabric, what configurations you have, and other things you need to know about it.

Reza talks a bit about history, interaction with languages, etc. As a quick addition to the languages list, you can use .NET languages like F# and C# with Spark, though it does involve setting up dotnet/spark and there are some open questions about its future. And I’m not even sure you could get it to work with Microsoft Fabric.

Comments closed

Getting the Top N Results in a PySpark Notebook

Gilbert Quevauvilliers only needs the top 1:

How to get the TopN rows using Python in Fabric Notebooks

When working with data there are sometimes weird and wonderful requirements which must be created in order to get to the desired solution.

In today’s blog post I had a situation where I wanted to get a single row with the highest duration.

Gilbert uses the Spark SQL version, specifically the Python function variant. You could also use Spark SQL and write a query using the LIMIT operator.

Comments closed

Looping through Data in Microsoft Fabric PySpark Notebooks

Gilbert Quevauvilliers builds a loop:

Continuing with my existing blog series on what I’m learning with notebooks and PySpark.

Today, I’m going to explain to you how I found a way to loop through data in a notebook.

In this example, I’m going to show you how I loop through a range of dates, which can then be used in a subsequent query to extract data by passing through each date into a DAX query.

Click through for Gilbert’s example. Here’s an alternative using something called a list comprehension. First, build a function that does what you want to do—that’d be the innards of Gilbert’s Python code, lines 31-54.

def perform_dax_query(row):
    var_Date = row["Date"]
    ...
    display(df_DAX_QueryResult)

Then, call that function for each row:

[perform_dax_query(row) for row in data_collect]

In this particular scenario, I’d personally stick with Gilbert’s composition, but in cases where you’re transforming a list of elements into a new list—for example, if you’re performing some data cleanup for each row in a list and you want the output to be a new list with cleaned-up data—then the list comprehension works really well.

Comments closed

An Introduction to the Native Execution Engine in Microsoft Fabric

Sandeep Pawar gives us the gentle version:

At MS Build, Microsoft announced availability of Native Execution Engine in Fabric. The product team will have detailed documentation and technical details but I will attempt to provide an ELI5 version of what it means and how it works at a 30000 ft level.

Read on to learn what the Native Execution Engine is and why Microsoft would use it versus running everything in the Java Virtual Machine like other services in the broader Hadoop ecosystem have historically done (at least at the beginning, before they built their own native execution engines!).

Comments closed

Adding the Current Date and Time to a PySpark Data Frame

Gilbert Quevauvilliers wants to know what time it is:

How to add current DateTime to existing PySpark data frame in a Fabric Notebook

In the blog post below, I am going to describe how to add the current Date Time to your existing Spark data frame.

This is really useful when I am inserting data into a Fabric Lakehouse table, and I want to know when the data got inserted.

Read on for the answer.

Comments closed

Comparing SQL Server to Databricks

Paul Andrew makes a comparison:

Microsoft SQL Server and Azure Databricks over the many years I’ve been working in the data/IT industry have easily become my two favourite data processing tools. When Databricks became a first-class resource in Microsoft Azure, it was a big moment for the evolution of the data platform architectures I’ve designed and built (but architecture isn’t the focus for this blog). That said, rather than considering the tooling and technology as an evolution, I find a lot of people drawing comparisons between the products. This often leads to confusion and friction, as they are ultimately offering a lot of different capabilities, with only some common areas where comparisons could be made.

Read on for Paul’s thoughts. Spoilers: I agree with pretty much all of it.

Comments closed

Visualizing a Spark Execution Plan

Gerhard Brueckl builds a very helpful tool:

I recently found myself in a situation where I had to optimize a Spark query. Coming from a SQL world originally I knew how valuable a visual representation of an execution plan can be when it comes to performance tuning. Soon I realized that there is no easy-to-use tool or snippet which would allow me to do that. Though, there are tools like DataFlint, the ubiquitous Spark monitoring UI or the Spark explain() function but they are either hard to use or hard to get up running especially as I was looking for something that works in both of my two favorite Spark engines being Databricks and Microsoft Fabric.

Read on for Gerhard’s answer, including an example of it in action.

Comments closed

Speeding up Databricks Lakehouse Queries with Redis

Drew Furgiuele has the need for speed:

Since compute and storage are now separated, this means that any time you want to work with your data, you need some form of compute engine that is capable of connecting to and reading your data from your storage locations. Compute engines vary, but one of the best is Apache Spark, which gives you a great distributed compute layer suitable for all sorts of workloads, whether they be analytical and ad-hoc queries, dashboard or BI workloads, data engineering related, or even data science or AI/ML use cases. It really can do it all, and it does it very well.

But what about use operational use cases? For instance: let’s say your Lakehouse is hosting some data that is critical to customer-facing systems that demand low-latency response times, such as real-time users lookups, API interfaces, or event-driven systems, sometimes the overhead required to take a query, schedule it, and run it can be in the hundreds of milliseconds. For some workloads, that’s a lifetime.

Read on to see how you can build a caching layer on top of certain lakehouse operations when some operation needs to be as fast as possible.

Comments closed