Press "Enter" to skip to content

Category: Python

Kernel Methods in Python

Matthew Mayo does a bit of kernel work:

Kernel methods are a powerful class of machine learning algorithm that allow us to perform complex, non-linear transformations of data without explicitly computing the transformed feature space. These methods are particularly useful when dealing with high-dimensional data or when the relationship between features is non-linear.

Kernel methods rely on the concept of a kernel function, which computes the dot product of two vectors in a transformed feature space without explicitly performing the transformation. This is known as the kernel trick. The kernel trick allows us to work in high-dimensional spaces efficiently, making it possible to solve complex problems that would be computationally infeasible otherwise.

Read on for the pros and cons of kernel methods and a pair of techniques that use them.

Leave a Comment

Near Real-Time Data Plotting in Python

Hristo Hristov wants to know where the International Space Station is:

Gathering data on events as they occur in real-time is a powerful and popular technique in scientific and industrial computing. If we can query an online REST API representing the position of the International Space Station’s (ISS), how can we visualize these data in real time? How do you plot the data points as soon as they arrive and observe changes in the station’s position immediately? Let’s look at using Python for a real time plot of data.

Click through for the solution and plenty of explanation along the way.

Leave a Comment

Data Masking in Azure Databricks

Rayis Imayev hides some information:

One way to protect sensitive information from end users in a database is through dynamic masking. In this process, the actual data is not altered; however, when the data is exposed or queried, the results are returned with modified values, or the actual values are replaced with special characters or notes indicating that the requested data is hidden for protection purposes.

In this blog, we will discuss a different approach to protecting data, where personally identifiable information (PII – a term you will frequently encounter when reading about data protection and data governance) is actually changed or updated in the database / persistent storage. This ensures that even if someone gains access to the data, nothing will be compromised. This is usually needed for refreshing the production database or dataset containing PII data elements to a lower environment. Your QA team will appreciate having a realistic data volume that resembles production environment but with masked data.

Rayis goes into depth on the process. I could also recommend checking out the article on row filters and column masks for more information.

Leave a Comment

Custom SCD2 with PySpark

Abhishek Trehan creates a type-2 slowly changing dimension:

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

SCD2 is a dimension that stores and manages current and historical data over time in a data warehouse. The purpose of an SCD2 is to preserve the history of changes. If a customer changes their address, for example, or any other attribute, an SCD2 allows analysts to link facts back to the customer and their attributes in the state they were at the time of the fact event.

Read on for an implementation in Python.

Leave a Comment

The Importance of Virtual Environments in Python

Jack Wallen proselytizes for virtual environments:

When developing with Python, chances are pretty good that you’ll need to install various libraries, dependencies and apps to get your project started. The good news is that (in most cases) those installations are pretty straightforward (thanks to pip and other tools).

Problems can arise, however, if you simply install all of those project requirements on your system. It’s like installing any given application, hoping it won’t cause problems with other applications, your OS or your data. In most cases, it’s safe, but there’s always that one instance where things can quickly go awry.

Read on to see how virtual environments can alleviate many of these pains. It took a while for me to understand exactly why virtual environments are so important, but this is definitely something I recommend doing if you work with Python in any capacity.

Leave a Comment

Finding Capacity-Level Fabric Settings with Semantic Link Labs

Sandeep Pawar lists some Microsoft Fabric properties:

Just before the holidays last year Michael Kovalsky released version 0.8.10 of Semantic Links Labs with a bunch of new helpful functions, among them list_server_properties() lists properties of an Analysis Services instance. As you know, in Fabric, the workspace acts as a server which is tied to a capacity. You define these server properties in the Capacity Settings. As far as I am aware, there wasn’t an API to get these capacity settings for audit/monitoring/debugging. With this new function, you can programmatically get the Semantic Model (i.e. Power BI workload) settings.

Click through for an example.

Comments closed

Prompt Flow in Azure AI

Tomaz Kastrun continues a series on Azure AI. First up is an introduction to Prompt Flow:

Prompt flow in Azure AI Foundry is development tool for designing the flows (streamlines) for the complete end-to-end development cycle of LLM’s AI application. You can create, iterate, test, orchestrate, debug, and monitor your flows.

After that, we get a demonstration a Prompt Flow in Python:

Prompty gives you the ability to create an end-to-end solution, like RAG where you can chat with LLM over an article or document, where you can ask to classify the input data (list of URLs,…)

Prompty is a markdown file, structured in YAML and encapsulates a series of metadata fields pivotal for defining the model’s configuration and the inputs. After this front matter is the prompt template, articulated in the Jinja format.

Comments closed

Switching between Python and PySpark Notebooks in Fabric

Sandeep Pawar wants to save some money:

File this under a test I have been wanting to do for some time. If I am exploring some data in a Fabric notebook using PySpark, can I switch between Python and PySpark engines with minimal code changes in an interactive session? The goal is to use the Python notebook for some exploration or use existing PySpark/SparkSQL or develop the logic in a low compute environment (to save CUs) and scale it in a distributed Spark environment. Understandably, there will be limitations with this approach given the difference in environments, configs etc., but can it be done?

Read on for the answer, as well as plenty of notes around it.

Comments closed

Scanning Fabric Workspaces via Semantic Link Labs

Sandeep Pawar takes us through the Scanner API:

It’s finally here! Thanks to Michael Kovalsky, one of the most requested & anticipated APIs in now available in Semantic Link Labs (v0.8.10) – the Scanner API. The Scanner API in Fabric Admin REST APIs allows Fabric administrators to retrieve detailed metadata about their organization’s Fabric items, supporting governance and compliance efforts. It provides information such as item names, descriptions, date created, lineage, connection strings etc. It’s not new, we have been using it in Power BI for a long time but in the Fabric world, it’s even more important given the number of items and configurations.

Read on to see what’s available and how this works.

Comments closed

Building and Deploying a Streamlit Data App

Ivan Palomares Carrascosa deploys an app:

This article will navigate you through the deployment of a simple machine learning (ML) for regression using Streamlit. This novel platform streamlines and simplifies deploying artifacts like ML systems as Web services.

I’ll leave aside my aside that linear regression isn’t machine learning. Click through to see how you can build a simple application in approximately 60 lines of code. This example shows off some of the simplicity in Streamlit’s design.

Comments closed