Press "Enter" to skip to content

Category: Python

Debugging Fabric UDFs in Visual Studio Code

Sunitha Muthukrishna takes us through a debugging exercise:

Debugging your code is important to identify issues and mitigate them when you’re working with user data functions in Microsoft Fabric. You want to make sure everything works as it should and that’s where local debugging lets you catch problems in your code without messing with the live environment. In this blog post, I will walk you through the steps to make local debugging easier and faster.

Click through to see what you’ll need, as well as the process to debug a function locally.

Comments closed

Advanced Imputation Techniques via scikit-learn

Ivan Palomares Carrascosa isn’t just using the median:

Missing values appear more often than not in many real-world datasets. There can be instances with missing values in one or several of their attributes for various reasons, such as human error, corrupted data, or incomplete data collection processes, e.g. from surveys with optional fields. While there exist basic strategies to deal with instances or attributes containing missing values, — like removing rows or columns entirely, or imputing missing values with a default value (typically the mean or median of the attribute) — these strategies are sometimes not sufficient.

This article presents some advanced strategies to handle missing data, namely, imputation techniques made possible through a combined use of Pandas and Scikit-learn libraries in Python.

Click through for three such techniques, including an example of how to use the technique and under which circumstances to avoid that technique.

Comments closed

Writing a Python Data Frame to a Lakehouse Table

Gilbert Quevauvilliers continues a series on Python notebooks and DuckDB:

In this blog post I am going to explain how to loop through a data frame to query data and write once to a Lakehouse table.

The example I will use is to loop through a list of dates which I get from my date table, then query an API, append to an existing data frame and finally write once to a Lakehouse table.

Click through for the code, as well as a sample notebook you can use.

Comments closed

Parameterization and Mocking in Python Tests

Aida Gjoka and Russ Hyde show off some capabilities in the pytest library:

Writing tests is one of the best ways to keep your code reliable and reproducible. This post builds on our previous blog about Python testing with pytest Part 1, and explores some of the more advanced features it offers. From parametrised fixtures to mocking and other useful pytest plugins, we will show how to make your tests more reproducible, easier to manage and demonstrate how writing simple tests can save you time in the long run.

Click through to learn more. I’m a huge fan of parameterization in pytest—it’s really easy to do. Mocks are a bit harder to pull off in practice, though quite useful.

Comments closed

Checking Key Vault Access in Microsoft Fabric Spark Notebooks

Marc Lelijveld has clearance:

Working with sensitive data in Microsoft Fabric requires careful handling of secrets, especially when collaborating externally. In a recent customer engagement, I needed to validate access to Azure Key Vault from within a Fabric Notebook, without ever exposing the actual secret values. With only read access granted and no need to manage or update secrets, I focused on confirming that the connection was working as expected.

In this blog, I’ll walk you through the approach, including the setup, code snippets, and logic behind this quick but crucial verification step.

Click through for the full story.

Comments closed

Caching Database Calls in Python with Redis

Levi Masonde stands up a Redis instance:

Databases play a vital role in software applications—they need to keep updated data or state, which is served by the database acting as the source of truth for the application. How the database performs affects how the entire application performs. Besides obvious factors that affect the performance of a database (hardware, database type, networking infrastructure), there are techniques designed to help you improve the performance of your database and ultimately, your applications. One way to do this is to add caching to your database. But which cache technique works best for which application requirements? This article sheds light on one strategy to implement a cache system.

Click through for one pattern of interaction between cache and database. My preference with the cache-aside pattern is to hide the two data platforms from the calling application as much as possible. In a classic object-oriented language like C#, the actual database + cache calls would be in a separate project and would expose methods on classes that were database-agnostic. With Python, I’d use different .py files in the same project unless I wanted to build a wheel file and deploy it to multiple projects, but the concept would still be the same. I’m not the biggest fan of the way that Levi did it, forcing API developers to have knowledge of both the cache and the database, as that increases the risk of a future developer messing something up.

Comments closed

Writing to Microsoft FabricDelta Tables in Python via DuckDB

Gilbert Quevauvilliers does a bit of writing:

When I was exploring how to easily write to Delta Tables with a Python notebook, it took me a considerable amount of time to find out how to do this.

This is my learnings below, and from my point of view it makes it easy to write to a Lakehouse table, like what is done with a PySpark notebook.

Click through for one very important note, as well as the process.

Comments closed