Python – Page 11 – Curated SQL

Generating a Multi-Aggregate Pivot in Spark

Published 2024-09-13 by Kevin Feasel

Richard Swinbank troubleshoots an issue:

I’m using a stream watermark to handle late arriving data – basically¹⁾ my watermark enables the stream to accept data arriving up to 10 seconds late …and that’s where the problem shows up.

When I run this streaming query – in Azure Databricks I can do this simply with display(df_pivot) – I receive the error:

AnalysisException: Detected pattern of possible ‘correctness’ issue due to global watermark. The query contains stateful operation which can emit rows older than the current watermark plus allowed late record delay, which are “late rows” in downstream stateful operations and these rows can be discarded. Please refer the programming guide doc for more details. If you understand the possible risk of correctness issue and still need to run the query, you can disable this check by setting the config `spark.sql.streaming.statefulOperator.checkCorrectness.enabled` to false.

Read on to learn more about the scenario, the issue, and the solution.

Comments closed

Dealing with Collinearity using Lasso Regression

Published 2024-09-10 by Kevin Feasel

Vinod Chugani always moves in the same direction:

One of the significant challenges statisticians and data scientists face is multicollinearity, particularly its most severe form, perfect multicollinearity. This issue often lurks undetected in large datasets with many features, potentially disguising itself and skewing the results of statistical models.

In this post, we explore the methods for detecting, addressing, and refining models affected by perfect multicollinearity. Through practical analysis and examples, we aim to equip you with the tools necessary to enhance your models’ robustness and interpretability, ensuring that they deliver reliable insights and accurate predictions.

Read on to learn a bit more about how collinearity works and how you can use lasso regression (instead of ridge regression) to deal with the problem.

Comments closed

From Pandas to Polars

Published 2024-09-09 by Kevin Feasel

Ari Lamstein explains why it might be worth a switch:

I recently decided to switch from Pandas to Polars for my Python projects that use dataframes. I came to this decision while taking a workshop on Polars last week: I found its syntax to be so intuitive that I couldn’t justify continuing to try to get “better” at Pandas, despite Pandas being the more established library. The fact that Polars is faster (it’s main selling point) was, surprisingly, not a factor in my decision.

A similar transformation recently happened in R. For most of the history of R there was only one way to interact with dataframes: Base R. Then the Tidyverse came along, and offered both performance improvements and easier syntax. Eventually the Tidyverse became the primary way that many people interact with dataframes. I believe that the Tidyverse’s easier syntax is what led to its widespread adoption, and I think that something similar is likely to happen with Polars.

Click through for Ari’s thoughts on the matter. H/T R-Bloggers.

Comments closed

Testing Python Code with pytest

Published 2024-09-05 by Kevin Feasel

Aida Gjoka builds some tests:

Testing code using automated tools is common throughout the software development industry. This technique can improve the quality of the code you write as a data scientist. Testing helps refine your code, supports redesign, prevents errors, and makes it harder to write single-use code.

Here, we introduce the pytest framework and show how it can be used to test Python functions. If you don’t use a testing framework as part of your daily workflow, try experimenting with the techniques presented here the next time you write or extend a function.

I am a big fan of pytest because it strikes what I consider to be a great balance between convention and customization. There’s very little administrative overhead to creating test classes and test cases, so tests are easy to build and can it’s trivial to run a test suite or a specific part of one.

Comments closed

Building an App to Use Fabric AI Skills Locally

Published 2024-09-04 by Kevin Feasel

Sandeep Pawar takes us on-premises:

If you are a regular reader of this blog, you probably know I have been testing Fabric AI Skills extensively. I have written three blogs so far on various ways the AI Skills endpoint can be used. The feature is still in preview but I am excited to see how it can be used to create new solutions as it matures.

I was curious to test if the AI Skills endpoint can be used locally and in other applications. This will open many opportunities to integrate it in different tools, inside and outside of Fabric ecosystem. So, I built an app using Gradio to make API calls to the endpoint and show the results in a local browser along with interactive plots.

Click through for a link to the code and some instructions on how to build it yourself.

Comments closed

From Anaconda to Standard Python

Published 2024-08-29 by Kevin Feasel

Rob Zelt switches to standard Python:

While Anaconda provides a comprehensive package management system, particularly useful for data science, many developers prefer the flexibility and lightweight nature of standard Python environments. This guide will help you make the switch without losing your carefully curated package setup.

Read on for Rob’s solution. I used to be a huge proponent of the Anaconda distribution of Python, but have found myself being less of one, especially with the licensing changes a few years back. If you were already using pip for most package installation, and if you’re fairly consistent about using virtual environments, this transition is even easier than in Rob’s scenario.

Comments closed

Tips for Bringing a Streamlit App into Production

Published 2024-08-28 by Kevin Feasel

I have wrapped up another series:

In this video, I discuss some of the things you should consider as you transition a Streamlit application from development to production. We will cover four methods of bringing a Streamlit app to production and some thoughts on performance optimization.

This one doesn’t have much in the way of demos, but I do spend a lot of time at the virtual whiteboard, so it’s got that going for it.

Comments closed

mssparkutils now notebookutils and Validating DAGs in Fabric

Published 2024-08-26 by Kevin Feasel

Sandeep Pawar gives us two quick hits:

First, if you haven’t noticed mssparkutils has been officially renamed to notebookutils. Check out the official documentation for details. Be sure to use/update your notebooks to notebookutils.

Read on for a pair of notes around this name change, as well as some capabilities to validate DAGs when using runMultiple to orchestrate multiple notebook executions.

Comments closed

Databricks Notebook Package Installation and Variables

Published 2024-08-26 by Kevin Feasel

Chen Hirsh diagnoses a problem:

A friend called to ask for my help with a weird issue. In a Databricks notebook using Python, he declares and assigns a variable in the first cell. Something like that:
my_var = 1
He then runs the rest of the notebook, and somewhere along the way, tries to use this variable, and gets this message:
NameError: name 'my_var' is not defined
Going back to cell 1, and checking the value of my_var, he gets the same error.

Read on for the root cause of the issue, as well as a pair of helpful tips from Chen.

Comments closed

Interpreting Linear Regression Model Coefficients

Published 2024-08-23 by Kevin Feasel

Vinod Chugani looks at a linear regression:

Linear regression models are foundational in machine learning. Merely fitting a straight line and reading the coefficient tells a lot. But how do we extract and interpret the coefficients from these models to understand their impact on predicted outcomes? This post will demonstrate how one can interpret coefficients by exploring various scenarios. We’ll delve into the analysis of a single numerical feature, investigate the role of categorical variables, and unpack the complexities introduced when these features are combined. Through this exploration, we aim to equip you with the skills needed to leverage linear regression models effectively, enhancing your analytical capabilities across different data-driven domains.

Click through for details, with examples in Python.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Category: Python