Press "Enter" to skip to content

Category: Spark

An Overview of Spark in Microsoft Fabric

Reza Rad gives people a primer on Apache Spark:

Microsoft Fabric runs some workloads under the Spark engine, but what is it really? In this article, I’ll take you through the question of what Spark is, What benefits it has, how it is associated with Fabric, what configurations you have, and other things you need to know about it.

Reza talks a bit about history, interaction with languages, etc. As a quick addition to the languages list, you can use .NET languages like F# and C# with Spark, though it does involve setting up dotnet/spark and there are some open questions about its future. And I’m not even sure you could get it to work with Microsoft Fabric.

Comments closed

Getting the Top N Results in a PySpark Notebook

Gilbert Quevauvilliers only needs the top 1:

How to get the TopN rows using Python in Fabric Notebooks

When working with data there are sometimes weird and wonderful requirements which must be created in order to get to the desired solution.

In today’s blog post I had a situation where I wanted to get a single row with the highest duration.

Gilbert uses the Spark SQL version, specifically the Python function variant. You could also use Spark SQL and write a query using the LIMIT operator.

Comments closed

Looping through Data in Microsoft Fabric PySpark Notebooks

Gilbert Quevauvilliers builds a loop:

Continuing with my existing blog series on what I’m learning with notebooks and PySpark.

Today, I’m going to explain to you how I found a way to loop through data in a notebook.

In this example, I’m going to show you how I loop through a range of dates, which can then be used in a subsequent query to extract data by passing through each date into a DAX query.

Click through for Gilbert’s example. Here’s an alternative using something called a list comprehension. First, build a function that does what you want to do—that’d be the innards of Gilbert’s Python code, lines 31-54.

def perform_dax_query(row):
    var_Date = row["Date"]
    ...
    display(df_DAX_QueryResult)

Then, call that function for each row:

[perform_dax_query(row) for row in data_collect]

In this particular scenario, I’d personally stick with Gilbert’s composition, but in cases where you’re transforming a list of elements into a new list—for example, if you’re performing some data cleanup for each row in a list and you want the output to be a new list with cleaned-up data—then the list comprehension works really well.

Comments closed

An Introduction to the Native Execution Engine in Microsoft Fabric

Sandeep Pawar gives us the gentle version:

At MS Build, Microsoft announced availability of Native Execution Engine in Fabric. The product team will have detailed documentation and technical details but I will attempt to provide an ELI5 version of what it means and how it works at a 30000 ft level.

Read on to learn what the Native Execution Engine is and why Microsoft would use it versus running everything in the Java Virtual Machine like other services in the broader Hadoop ecosystem have historically done (at least at the beginning, before they built their own native execution engines!).

Comments closed

Adding the Current Date and Time to a PySpark Data Frame

Gilbert Quevauvilliers wants to know what time it is:

How to add current DateTime to existing PySpark data frame in a Fabric Notebook

In the blog post below, I am going to describe how to add the current Date Time to your existing Spark data frame.

This is really useful when I am inserting data into a Fabric Lakehouse table, and I want to know when the data got inserted.

Read on for the answer.

Comments closed

Comparing SQL Server to Databricks

Paul Andrew makes a comparison:

Microsoft SQL Server and Azure Databricks over the many years I’ve been working in the data/IT industry have easily become my two favourite data processing tools. When Databricks became a first-class resource in Microsoft Azure, it was a big moment for the evolution of the data platform architectures I’ve designed and built (but architecture isn’t the focus for this blog). That said, rather than considering the tooling and technology as an evolution, I find a lot of people drawing comparisons between the products. This often leads to confusion and friction, as they are ultimately offering a lot of different capabilities, with only some common areas where comparisons could be made.

Read on for Paul’s thoughts. Spoilers: I agree with pretty much all of it.

Comments closed

Visualizing a Spark Execution Plan

Gerhard Brueckl builds a very helpful tool:

I recently found myself in a situation where I had to optimize a Spark query. Coming from a SQL world originally I knew how valuable a visual representation of an execution plan can be when it comes to performance tuning. Soon I realized that there is no easy-to-use tool or snippet which would allow me to do that. Though, there are tools like DataFlint, the ubiquitous Spark monitoring UI or the Spark explain() function but they are either hard to use or hard to get up running especially as I was looking for something that works in both of my two favorite Spark engines being Databricks and Microsoft Fabric.

Read on for Gerhard’s answer, including an example of it in action.

Comments closed

Speeding up Databricks Lakehouse Queries with Redis

Drew Furgiuele has the need for speed:

Since compute and storage are now separated, this means that any time you want to work with your data, you need some form of compute engine that is capable of connecting to and reading your data from your storage locations. Compute engines vary, but one of the best is Apache Spark, which gives you a great distributed compute layer suitable for all sorts of workloads, whether they be analytical and ad-hoc queries, dashboard or BI workloads, data engineering related, or even data science or AI/ML use cases. It really can do it all, and it does it very well.

But what about use operational use cases? For instance: let’s say your Lakehouse is hosting some data that is critical to customer-facing systems that demand low-latency response times, such as real-time users lookups, API interfaces, or event-driven systems, sometimes the overhead required to take a query, schedule it, and run it can be in the hundreds of milliseconds. For some workloads, that’s a lifetime.

Read on to see how you can build a caching layer on top of certain lakehouse operations when some operation needs to be as fast as possible.

Comments closed

Date Calculation Bug in Power Query ODBC Code

Meagan Longoria files a report:

I was working on an imported Power BI semantic model, adding some fiscal year calculations to my date table. The date table was sourced from a view in Databricks Unity Catalog. I didn’t have access to add more fields to the view, so I was adding the fields in Power Query first, with plans to request they be added to the view in the future. I got some unexpected results, which turned into a bug being logged for the ODBC code for Power Query.

If you are only analyzing data in the last 20 years, you won’t see this bug. But if you are doing long-term analysis including years before 2000, you might just run into it.

Read on to see the bug, how you can replicate it, and three workarounds you can use to avoid it.

Comments closed

Testing with Databricks

Anh Nguyen Viet shares some thoughts on testing in Databricks:

With diverse support and a focus on workspace uniformity, Databricks can bring many benefits to the testing process, such as the following:

  • Centralized: Databricks provides an integrated environment for many teams (including testing team also), allowing them to work focused and productive. Integrating tools and services in a single platform reduces fragmentation and increases efficiency during testing.
  • Consistency: Databricks offers integrated tools and services, allowing testers to work consistently across the entire testing process as a uniform and efficient working environment.
  • Enhanced Productivity and Cost Reduction: With the flexibility and efficiency in data processing supported by DataBricks, testers can save time and effort, thereby increasing work productivity and reducing project costs. Utilizing utilities properly helps automate the testing process and delivers better results.

Read on for a few tips around building tests using Databricks.

Comments closed