Press "Enter" to skip to content

Category: Machine Learning

The Dual Perils of Overfitting and Data Leakage

John Mount shares notes on a theme:

One of the bigger risks of iterative statistical or machine learning fitting procedures is over-fit or the dreaded data leak.

Over-fit is when: a model performs better on training data than on future data. Some degree of over-fit is expected. A data leak is when: the model learns things about the evaluation set that it would not know about the future data the model will be applied on. This can drive models that look great on training and (supposedly) held-out data, but don’t work in practice.

Click through for the rest of the story, and be sure to check out the comments for a notebook digging further into one of the topics.

Leave a Comment

Model Diagnostics for Statistics vs Machine Learning

Christian Lorentzen talks diagnostics:

In this post, we show how different use cases require different model diagnostics. In short, we compare (statistical) inference and prediction.

As an example, we use a simple linear model for the Munich rent index dataset, which was kindly provided by the authors of Regression – Models, Methods and Applications 2nd ed. (2021). This dataset contains monthy rents in EUR (rent) for about 3000 apartments in Munich, Germany, from 1999.

Read on to learn more about this dataset and how the mindset differs if you’re thinking about inference versus prediction.

Leave a Comment

Goodbye, Azure ML SDK v1

I have a new video:

In this video, I cover some news from Microsoft around the deprecation of the Azure Machine Learning SDK v1. We’ll take a look at the upgrade guide and see what it will take to perform this upgrade.

Microsoft will still support the SDK v1 until September of 2026, so we have a year to get code sorted out. The CLI v1, however, will go away sooner, so be sure you’re keeping up on that.

Leave a Comment

Fine-Tuning a DistilBERT Model for Question Answering

Muhammad Asad Iqbal Khan builds upon a simple model:

The transformers library provides a clean and well-documented interface for many popular transformer models. Not only it makes the source code easier to read and understand, it also provided a standardize way to interact with the model. You have seen in the previous post how to use a model such as DistilBERT for natural language processing tasks. In this post, you will learn how to fine-tune the model for your own purpose. This expands the use of the model from inference to training. Specifically, you will learn:

  • How to prepare the dataset for training
  • How to train a model using a helper library

DistilBERT is a major simplification of BERT, but it comes with the advantage that it’s very easy to train on modest hardware and performance is in the same realm of acceptability as the full BERT model. Switching from DistilBERT to BERT isn’t as easy as just swapping out model classes, though it’s pretty close.

Comments closed

Serving Databricks Models via API Management Endpoints

Drew Furgiuele makes available a model:

When it comes to generative AI projects I’d argue that the hardest and most tedious part has moved into a new area: hosting and serving your models. Whether you’re working with CPU intensive models, or models that require GPU horsepower, sourcing the hardware, building out deployment pipelines, configuring monitoring, and then securing everything is real, serious work that requires everyone to lean in to get it right.

And then, there’s the real question of how you’re going to use those models: will you be setting up automation and doing batch processing using your models and infrastructure? Or do you want to get really serious and offer up real-time inference? If the latter, you can add one more thing to solve for: managing your front-end APIs that you will have to build to support that use case.

Click through to see how you can use an API management tool (like Azure API Management) to assist in these things.

Comments closed

Local Text Summarization via DistilBart

Muhammad Asad Iqbal Khan summarizes a document:

Text summarization represents a sophisticated evolution of text generation, requiring a deep understanding of content and context. With encoder-decoder transformer models like DistilBart, you can now create summaries that capture the essence of longer text while maintaining coherence and relevance.

In this tutorial, you’ll discover how to implement text summarization using DistilBart. You’ll learn through practical, executable examples, and by the end of this guide, you’ll understand both the theoretical foundations and hands-on implementation details. After completing this tutorial, you will know:

Click through for the article.

Comments closed

Kernel Methods in Python

Matthew Mayo does a bit of kernel work:

Kernel methods are a powerful class of machine learning algorithm that allow us to perform complex, non-linear transformations of data without explicitly computing the transformed feature space. These methods are particularly useful when dealing with high-dimensional data or when the relationship between features is non-linear.

Kernel methods rely on the concept of a kernel function, which computes the dot product of two vectors in a transformed feature space without explicitly performing the transformation. This is known as the kernel trick. The kernel trick allows us to work in high-dimensional spaces efficiently, making it possible to solve complex problems that would be computationally infeasible otherwise.

Read on for the pros and cons of kernel methods and a pair of techniques that use them.

1 Comment