Press "Enter" to skip to content

Category: Machine Learning

Poisson Hidden Markov Models in SAS

Ji Shen shows off how to perform discrete time series in SAS:

The HMM procedure in SAS Viya supports hidden Markov models (HMMs) and other models embedded with HMM. PROC HMM supports finite HMM, Poisson HMM, Gaussian HMM, Gaussian mixture HMM, the regime-switching regression model, and the regime-switching autoregression model. This post introduces Poisson HMM, the latest addition to PROC HMM in the SAS Viya 2023.03 release.

Count time series is ill-suited for most traditional time series analysis techniques, which assume that the time series values are continuously distributed. This can present unique challenges for organizations that need to model and forecast them. As a popular discrete probability distribution to handle the count time series, the Poisson distribution or the mixed Poisson distribution might not always be suitable. This is because both assume that the events occur independently of each other and at a constant rate. In time series data, however, the occurrence of an event at one point in time might be related to the occurrence of an event at another point in time, and the rates at which events occur might vary over time.

HMM is a valuable tool that can handle overdispersion and serial dependence in the data. This makes it an effective solution for modeling and forecasting count time series. We will explain how the Poisson HMM can handle count time series by modeling different states by using distinct Poisson distributions while considering the probability of transitioning between them.

Read on for an overview of Hidden Markov Models (in general and the Poisson variation in particular) and some of the challenges you can run into when performing this test.

Comments closed

Paper Review: Moving Fast with Broken Data

Adnan Masood reviews a paper:

I recently came across an insightful research paper titled “Moving Fast With Broken Data” by Shreya Shankar, Labib Fawaz, Karl Gyllstrom, and Aditya G. Parameswaran from UC Berkeley and Meta. The paper addresses the significant issue of data corruption in machine learning (ML) pipelines, which often leads to decreased model accuracy. The authors present an automatic data validation system implemented at Meta that aims to solve this problem.

Sounds like I have some beach reading.

Ed. Note: He’s kidding, right?

Ed. 2 Note: About going to the beach maybe.

Ed. & Ed. 2 Note: HAHAHAHAHAH.

Yeah, I hired Statler and Waldorf as my editors. Worst Best decision of my life.

Comments closed

Building a Model with shiny and tinyAML

Steven Sanderson has a series on using the tidyAML Model Builder. Part 1 builds a simple model:

The first reactive expression, data, reads in the data file uploaded by the user or selects a built-in dataset, depending on which option the user chooses. If the user uploads a file, the read.csv() function is used to read the data file into a data frame. If the user selects a built-in dataset, the get() function is used to retrieve the data frame associated with that dataset. In both cases, the column names of the data frame are used to update the choices in the predictor_col select input, so that the user can select which column to use as the predictor variable.

Part 2 builds on it by adding new regression algorithms:

Yesterday I spoke about building tidymodels models using my package {tidyAML} and {shiny}. I have made an update to it, and will continue to make updates to it this week.

I have added all of the supported engines for regression problems only, NOT classification yet, that will be tomorrow’s work. I will then add a drop down for users to pick which backend function they want to use from {parsnp} like linear_reg().

Comments closed

Hybrid ML and Rules-Based Fraud Detection

Ayodeji Ogunlami mixes approaches:

In developing this hybrid system, sets of rules are required as well as a machine learning model. I would be making use of a vehicle insurance dataset from Kaggle in this demonstration.

The dataset can be downloaded from this link:

The ML model would be built using a random forest classifier on Azure Databricks using Pyspark.

This seems to be the most sensible approach, especially given how rare actual fraud incidents are and what that imbalance does to classification algorithms.

Comments closed

Improving the Robustness of ML Model Deployment

Alexander Billington shares a few tools and tips:

Machine learning (ML) model deployment is a critical part of the MLOps lifecycle, and it can be a challenging process. In the previous blog, we explored how Azure Functions can simplify the deployment process. However, there are many other factors to consider when deploying ML models to production environments. In this blog, we’ll delve deeper into some of the essential hints and tips for more robust model deployments. We’ll look at topics such as proper model versioning and packaging, data validation, and performative code optimisations. By implementing these practices, data scientists and ML engineers can ensure their models are deployed efficiently, accurately, and with minimal downtime.

MLflow is definitely a good recommendation, as is Pydantic (which is on my to-learn list…one of these days).

Comments closed

Model Deployment using Azure Functions

Alexander Billington needs to get that new model out:

Deploying machine learning (ML) models into production can be challenging, as it requires careful consideration of various factors such as scalability, reliability, and maintainability. While developing an ML model is an exciting process, deploying it into production can be a daunting task. The challenges faced in productionising data science projects can range from infrastructure to version control, model monitoring to integration with other systems. This blog will take a look at how Azure Functions can simplify the deployment process, getting models into production quickly and robustly to maximise their value.

I like this approach and find it interesting, as most of the time, the MLOps model Microsoft recommends has you scheduling Azure DevOps pipelines / GitHub Actions periodically or when new training data hits a specific folder. If you have some non-standard trigger for an action, this is a good way to get you going.

Comments closed

Visualizing PyTorch Models

Adrian Tam describes a model:

PyTorch is a deep learning library. You can build very sophisticated deep learning models with PyTorch. However, there are times you want to have a graphical representation of your model architecture. In this post, you will learn:

  • How to save your PyTorch model in an exchange format
  • How to use Netron to create a graphical representation.

Click through for the article, which is mostly about training the PyTorch model. Visualizing it turns out to be pretty easy with the right tool.

Comments closed

Synapse and Azure ML Pipelines

Santosh Thomas integrates two Azure products:

As more customers standardize on the Synapse data platform, enabling machine learning workflows through Azure Machine Learning (Azure ML) becomes particularly interesting. This is especially true as more customers look to bring their data engineering and data science practices together and mature capabilities on both sides.

The goal of this blog post is to highlight how Synapse and Azure ML can work well together to deliver key insights. This is motivated by a scenario where a customer modernized their data platform on Azure Synapse but was looking to improve their data science practices through Azure ML. The focus of this blog is to expose existing functionality, and it is not a “hardened” solution with security or other cloud best practice implementations. The workflow steps also assume some level of comfort with Python and working with the Azure Python SDKs.

There was a time in which Microsoft wanted us to remain in Synapse for machine learning tasks, but that time is gone: the emphasis is definitely to do machine learning tasks in Azure ML, regardless of where the data lives…unless there’s a Spark job involved, in which case things get all weird again.

Comments closed

Azure ML Overview

Sanil Mhatre gets us started with Azure Machine Learning:

The five-part series is designed to jump-start any IT professional’s journey in the fascinating world of Data Science with Azure Machine Learning (Azure ML). Readers don’t need prior knowledge of Data Science, Machine Learning, Statistics, or Azure to begin this adventure.

All you will need is an Azure subscription and I will show you how to get a free one that you can use to explore some of Azure’s features before I show you how to set up the Azure ML environment.

Part 1 is available now, with the other parts coming up soon. Even so, Part 1 is a big article on its own.

Comments closed

Dealing with Imbalanced Class Data for Image Classification

Alexander Billington needs more beta carotene:

Image classification is a standard computer vision task and involves training a model to assign a label to a given image, such as a model to classify images of different root vegetables. A big problem with classification is bias, and the models favouring a particular image class above the others. A common cause of this can be dataset imbalance, and it is often hard to spot as a model trained on an imbalanced dataset can often still have good accuracy. E.g. if there are 1000 images in the test dataset, 950 potatoes and 50 carrots and the model predicted all 1000 images to be potatoes it would still have 95% accuracy. This is also an example of why more metrics than accuracy should be considered… but let’s leave that discussion for another day.

Click through for several techniques you can use to balance out classes, with a focus on image classification. Undersampling is almost always a no-go for me, though I am much fonder of the other techniques.

Comments closed