Spark – Page 7 – Curated SQL

An Introduction to the Native Execution Engine in Microsoft Fabric

Published 2024-05-28 by Kevin Feasel

Sandeep Pawar gives us the gentle version:

At MS Build, Microsoft announced availability of Native Execution Engine in Fabric. The product team will have detailed documentation and technical details but I will attempt to provide an ELI5 version of what it means and how it works at a 30000 ft level.

Read on to learn what the Native Execution Engine is and why Microsoft would use it versus running everything in the Java Virtual Machine like other services in the broader Hadoop ecosystem have historically done (at least at the beginning, before they built their own native execution engines!).

Comments closed

Adding the Current Date and Time to a PySpark Data Frame

Published 2024-05-09 by Kevin Feasel

Gilbert Quevauvilliers wants to know what time it is:

How to add current DateTime to existing PySpark data frame in a Fabric Notebook

In the blog post below, I am going to describe how to add the current Date Time to your existing Spark data frame.

This is really useful when I am inserting data into a Fabric Lakehouse table, and I want to know when the data got inserted.

Read on for the answer.

Comments closed

Comparing SQL Server to Databricks

Published 2024-05-03 by Kevin Feasel

Paul Andrew makes a comparison:

Microsoft SQL Server and Azure Databricks over the many years I’ve been working in the data/IT industry have easily become my two favourite data processing tools. When Databricks became a first-class resource in Microsoft Azure, it was a big moment for the evolution of the data platform architectures I’ve designed and built (but architecture isn’t the focus for this blog). That said, rather than considering the tooling and technology as an evolution, I find a lot of people drawing comparisons between the products. This often leads to confusion and friction, as they are ultimately offering a lot of different capabilities, with only some common areas where comparisons could be made.

Read on for Paul’s thoughts. Spoilers: I agree with pretty much all of it.

Comments closed

Visualizing a Spark Execution Plan

Published 2024-05-01 by Kevin Feasel

Gerhard Brueckl builds a very helpful tool:

I recently found myself in a situation where I had to optimize a Spark query. Coming from a SQL world originally I knew how valuable a visual representation of an execution plan can be when it comes to performance tuning. Soon I realized that there is no easy-to-use tool or snippet which would allow me to do that. Though, there are tools like DataFlint, the ubiquitous Spark monitoring UI or the Spark explain() function but they are either hard to use or hard to get up running especially as I was looking for something that works in both of my two favorite Spark engines being Databricks and Microsoft Fabric.

Read on for Gerhard’s answer, including an example of it in action.

Comments closed

Speeding up Databricks Lakehouse Queries with Redis

Published 2024-04-11 by Kevin Feasel

Drew Furgiuele has the need for speed:

Since compute and storage are now separated, this means that any time you want to work with your data, you need some form of compute engine that is capable of connecting to and reading your data from your storage locations. Compute engines vary, but one of the best is Apache Spark, which gives you a great distributed compute layer suitable for all sorts of workloads, whether they be analytical and ad-hoc queries, dashboard or BI workloads, data engineering related, or even data science or AI/ML use cases. It really can do it all, and it does it very well.

But what about use operational use cases? For instance: let’s say your Lakehouse is hosting some data that is critical to customer-facing systems that demand low-latency response times, such as real-time users lookups, API interfaces, or event-driven systems, sometimes the overhead required to take a query, schedule it, and run it can be in the hundreds of milliseconds. For some workloads, that’s a lifetime.

Read on to see how you can build a caching layer on top of certain lakehouse operations when some operation needs to be as fast as possible.

Comments closed

Date Calculation Bug in Power Query ODBC Code

Published 2024-04-03 by Kevin Feasel

Meagan Longoria files a report:

I was working on an imported Power BI semantic model, adding some fiscal year calculations to my date table. The date table was sourced from a view in Databricks Unity Catalog. I didn’t have access to add more fields to the view, so I was adding the fields in Power Query first, with plans to request they be added to the view in the future. I got some unexpected results, which turned into a bug being logged for the ODBC code for Power Query.

If you are only analyzing data in the last 20 years, you won’t see this bug. But if you are doing long-term analysis including years before 2000, you might just run into it.

Read on to see the bug, how you can replicate it, and three workarounds you can use to avoid it.

Comments closed

Testing with Databricks

Published 2024-03-26 by Kevin Feasel

Anh Nguyen Viet shares some thoughts on testing in Databricks:

With diverse support and a focus on workspace uniformity, Databricks can bring many benefits to the testing process, such as the following:

Centralized: Databricks provides an integrated environment for many teams (including testing team also), allowing them to work focused and productive. Integrating tools and services in a single platform reduces fragmentation and increases efficiency during testing.

Consistency: Databricks offers integrated tools and services, allowing testers to work consistently across the entire testing process as a uniform and efficient working environment.

Enhanced Productivity and Cost Reduction: With the flexibility and efficiency in data processing supported by DataBricks, testers can save time and effort, thereby increasing work productivity and reducing project costs. Utilizing utilities properly helps automate the testing process and delivers better results.

Read on for a few tips around building tests using Databricks.

Comments closed

Changing the Timeout of a Spark Session in Microsoft Fabric

Published 2024-03-20 by Kevin Feasel

Koen Verbeeck doesn’t have time to wait:

You might know the feeling: you’re writing code in a Notebook in Microsoft Fabric and suddenly you have to leave your workstation for a while. Someone ran the doorbell (you’re working from home and you get some parcels delivered), or you took a coffee break with some colleagues. When you return to your notebook, the Spark session has timed out and when you run a cell, you have to wait for the damn thing to restart again. The agony, waiting for 2-3 minutes for the session to start, and only after that the actual code can start running.

Read on to see how you can set the timeout to a custom value, assuming you’re okay with paying for the Spark cluster to sit around until it times out.

Comments closed

Retrieving Spark Session Config Variables from Microsoft Fabric

Published 2024-03-14 by Kevin Feasel

Koen Verbeeck gets some settings:

I was trying some stuff out in a notebook on top of a Microsoft Fabric Lakehouse. I was wondering what some of the default values are of the configuration variables, and if there’s an easy way to retrieve them all. Luckily there is. In the code, I’m using Scala because it has a nice GetAll() function.

Click through for an example of how to use this. And bonus points for using Scala instead of Python here.

Comments closed

Using Apache Spark in Microsoft Fabric

Published 2024-03-01 by Kevin Feasel

Ginger Grant gives us an overview of where we can use Apache Spark in Microsoft Fabric:

If you have used Spark in Azure Synapse, prepare to be pleasantly surprised with the compute experience in Microsoft Fabric as Spark compute starts a lot faster because the underlying technology has changed. The Data Engineering and Data Science Fabric experiences include a managed Spark compute, which like previous Spark compute charges you when it is in use. The difference is the nodes are reserved for you, rather than allocated when you start the compute which results in compute starting in 30 seconds or less versus the 4 minutes of waiting it takes for Azure Synapse compute to start. If you have different capacity needs that a default managed Spark compute will not provide, you can always create a custom pool. Custom pools are created in a specific workspace, so you will need Administrator permissions on the workspace to create them. You can choose to make the new pool your default pool as well, so it will be what starts in the workspace.

Read on for more of Ginger’s thoughts on the matter, including how you can use Copilot in Microsoft Fabric (if you pay for it) to help generate Spark code.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Category: Spark