Press "Enter" to skip to content

Category: Python

Custom Libraries in Microsoft Fabric Data Engineering

Gerhard Brueckl isn’t content with the defaults:

When working with Spark or data engineering in general in Microsoft Fabric, you will sooner or later come to the point where you need to reuse some of the code that you have already written in another notebook. Best practice is to put these code pieces into a central place from where it can be referenced and reused. This way you can make sure all notebooks always use the very same code and it is also easy to develop, update and test the common functions.

As Gerhard mentions, having common notebooks with utilities is fine for when you’re getting started with development, but being able to centralize functions in proper libraries can make that code a lot more useful, not just in the context of the single notebook.

I believe that this does allow for arbitrary code execution, so someone with sufficient permissions to create a notebook and import code from arbitrary locations would be able to execute that code. I think there are ways of limiting this risk (such as not allowing your Fabric hosts to connect to any remote servers other than ones you explicitly allow), but it’s something I’d have to puzzle through.

Comments closed

Vector Search from Scratch

Kanwai Mehreen does a bit of searching:

In this article, I’ll walk you through every step from generating vector representations to searching using cosine similarity, and we’ll even visualize what’s happening behind the scenes. By the end, you’ll not only understand how vector search works but also have a working implementation you can build on. So, let’s get started.

It’s kind of funny how simple this is, but it is. A lot of the complexity is around data quality operations, as well as optimizing the search process.

Comments closed

Debugging Fabric UDFs in Visual Studio Code

Sunitha Muthukrishna takes us through a debugging exercise:

Debugging your code is important to identify issues and mitigate them when you’re working with user data functions in Microsoft Fabric. You want to make sure everything works as it should and that’s where local debugging lets you catch problems in your code without messing with the live environment. In this blog post, I will walk you through the steps to make local debugging easier and faster.

Click through to see what you’ll need, as well as the process to debug a function locally.

Comments closed

Advanced Imputation Techniques via scikit-learn

Ivan Palomares Carrascosa isn’t just using the median:

Missing values appear more often than not in many real-world datasets. There can be instances with missing values in one or several of their attributes for various reasons, such as human error, corrupted data, or incomplete data collection processes, e.g. from surveys with optional fields. While there exist basic strategies to deal with instances or attributes containing missing values, — like removing rows or columns entirely, or imputing missing values with a default value (typically the mean or median of the attribute) — these strategies are sometimes not sufficient.

This article presents some advanced strategies to handle missing data, namely, imputation techniques made possible through a combined use of Pandas and Scikit-learn libraries in Python.

Click through for three such techniques, including an example of how to use the technique and under which circumstances to avoid that technique.

Comments closed

Writing a Python Data Frame to a Lakehouse Table

Gilbert Quevauvilliers continues a series on Python notebooks and DuckDB:

In this blog post I am going to explain how to loop through a data frame to query data and write once to a Lakehouse table.

The example I will use is to loop through a list of dates which I get from my date table, then query an API, append to an existing data frame and finally write once to a Lakehouse table.

Click through for the code, as well as a sample notebook you can use.

Comments closed

Parameterization and Mocking in Python Tests

Aida Gjoka and Russ Hyde show off some capabilities in the pytest library:

Writing tests is one of the best ways to keep your code reliable and reproducible. This post builds on our previous blog about Python testing with pytest Part 1, and explores some of the more advanced features it offers. From parametrised fixtures to mocking and other useful pytest plugins, we will show how to make your tests more reproducible, easier to manage and demonstrate how writing simple tests can save you time in the long run.

Click through to learn more. I’m a huge fan of parameterization in pytest—it’s really easy to do. Mocks are a bit harder to pull off in practice, though quite useful.

Comments closed

Checking Key Vault Access in Microsoft Fabric Spark Notebooks

Marc Lelijveld has clearance:

Working with sensitive data in Microsoft Fabric requires careful handling of secrets, especially when collaborating externally. In a recent customer engagement, I needed to validate access to Azure Key Vault from within a Fabric Notebook, without ever exposing the actual secret values. With only read access granted and no need to manage or update secrets, I focused on confirming that the connection was working as expected.

In this blog, I’ll walk you through the approach, including the setup, code snippets, and logic behind this quick but crucial verification step.

Click through for the full story.

Comments closed