Category: Machine Learning

One of the most exciting aspects of deep learning’s emergence in computer vision a few years ago was that it didn’t appear to require any feature engineering, unlike previous techniques like histograms-of-gradients or Haar cascades. As neural networks ate up other fields like NLP and speech, the hope was that feature engineering would become unnecessary for those domains too. At first I fully bought into this idea, and saw any remaining manually-engineered feature pipelines as legacy code that would soon be subsumed by more advanced models.
Over the last few years of working with product teams to deploy models in production I’ve realized I was wrong. I’m not the first person to raise this idea, but I have some thoughts I haven’t seen widely discussed on exactly why feature engineering isn’t going away anytime soon. One of them is that even the original vision case actually does rely on a *lot* of feature engineering, we just haven’t been paying attention.

Read the whole thing.

Comments closed

Building an MLOps Workflow with SageMaker and GitLab

Published 2021-11-26 by Kevin Feasel

Lauren Mullennex, et al, build out some pipelines:

Machine learning operations (MLOps) are key to effectively transition from an experimentation phase to production. The practice provides you the ability to create a repeatable mechanism to build, train, deploy, and manage machine learning models. To quickly adopt MLOps, you often require capabilities that use your existing toolsets and expertise. Projects in Amazon SageMaker give organizations the ability to easily set up and standardize developer environments for data scientists and CI/CD (continuous integration, continuous delivery) systems for MLOps engineers. With SageMaker projects, MLOps engineers or organization administrators can define templates that bootstrap the ML workflow with source version control, automated ML pipelines, and a set of code to quickly start iterating over ML use cases. With projects, dependency management, code repository management, build reproducibility, and artifact sharing and management become easy for organizations to set up. SageMaker projects are provisioned using AWS Service Catalog products. Your organization can use project templates to provision projects for each of your users.
In this post, you use a custom SageMaker project template to incorporate CI/CD practices with GitLab and GitLab pipelines. You automate building a model using Amazon SageMaker Pipelines for data preparation, model training, and model evaluation. SageMaker projects builds on Pipelines by implementing the model deployment steps and using SageMaker Model Registry, along with your existing CI/CD tooling, to automatically provision a CI/CD pipeline. In our use case, after the trained model is approved in the model registry, the model deployment pipeline is triggered via a GitLab pipeline.

Click through for the step-by-step guide on how to do this.

Comments closed

MMLSpark Is Now SynapseML

Published 2021-11-24 by Kevin Feasel

Mark Hamilton has an announcement:

Today, we’re excited to announce the release of SynapseML (previously MMLSpark), an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines. Building production-ready distributed ML pipelines can be difficult, even for the most seasoned developer. Composing tools from different ecosystems often requires considerable “glue” code, and many frameworks aren’t designed with thousand-machine elastic clusters in mind. SynapseML resolves this challenge by unifying several existing ML frameworks and new Microsoft algorithms in a single, scalable API that’s usable across Python, R, Scala, and Java.

Read on to learn more about the library.

Comments closed

ML Updates in Azure Synapse Analytics

Published 2021-11-09 by Kevin Feasel

Aria Jelinek and Nellie Gustafsson have some announcements for us:

Announced last week at Ignite 2021, data teams now have a handful of new opportunities to drive value with machine learning built directly into their Apache Spark pools in Azure Synapse Analytics.
With the general availability of our machine learning library for Apache Spark on Azure Synapse, data teams now have expanded access to both code-first and code-free ML tools for forecasting, model training, and pre-built AI. This library provides both familiar open-source tools such as LightGBM as well as proprietary solutions to provide a comprehensive, streamlined approach to ML workloads. Updates include PREDICT, a new keyword that supports scoring AzureML and MLFlow models directly in Azure Synapse, and integration with Azure Cognitive Services, now generally available.

Click through for all of the announcements.

1 Comment

Document Classification in Python

Published 2021-11-01 by Kevin Feasel

Brendan Tierney performs a bit of document classification with scikit-learn and nltk:

Text mining is a popular topic for exploring what text you have in documents etc. Text mining and NLP can help you discover different patterns in the text like uncovering certain words or phases which are commonly used, to identifying certain patterns and linkages between different texts/documents. Combining this work on Text mining you can use Word Clouds, time-series analysis, etc to discover other aspects and patterns in the text. Check out my previous blog posts (post 1, post 2) on performing Text Mining on documents (manifestos from some of the political parties from the last two national government elections in Ireland). These two posts gives you a simple indication of what is possible.
We can build upon these Text Mining examples to include other machine learning algorithms like those for Classification. With Classification we want to predict or label a record or document to have a particular value. With Classification this could involve labeling a document as being positive or negative (movie or book reviews), or determining if a document is for a particular domain such as Technology, Sports, Entertainment, etc

Click through for a walkthrough of this process.

Comments closed

GPU-Accelerated Analysis on Databricks using PyTorch + Huggingface

Published 2021-11-01 by Kevin Feasel

Srijith Rajamohan walks us through an example of sentiment analysis using the PyTorch and Huggingface libraries on Databricks:

Sentiment analysis is commonly used to analyze the sentiment present within a body of text, which could range from a review, an email or a tweet. Deep learning-based techniques are one of the most popular ways to perform such an analysis. However, these techniques tend to be very computationally intensive and often require the use of GPUs, depending on the architecture and the embeddings used. Huggingface (https://huggingface.co) has put together a framework with the transformers package that makes accessing these embeddings seamless and reproducible. In this work, I illustrate how to perform scalable sentiment analysis by using the Huggingface package within PyTorch and leveraging the ML runtimes and infrastructure on Databricks.

Click through for a description of the process, as well as a link to a notebook you can walk through yourself.

Comments closed

Understanding Support Vector Machines

Published 2021-10-14 by Kevin Feasel

Luis Valencia takes us through the algorithm for support vector machines:

A support vector machine (SVM) is a supervised machine learning model that uses classification algorithms for two-group classification problem. Compared to newer algorithms like neural networks, they have two main advantages: higher speed and better performance with a limited number of samples (in the thousands).

Pepperidge Farms remembers when we used genetic algorithms to solve problems because support vector machines were too slow.

Comments closed

Detecting Hard-to-Classify Data

Published 2021-10-12 by Kevin Feasel

Kaushal Mukherjee takes us through a new Python package:

The article explains the algorithm behind the recently introduced Python package named PyHard, based on the concept of Instance Space Analysis. It helps in assessing the quality of a dataset and identifying what are the instances which are hard/easy to classify. With the help of this algorithm we can separate out noisy instances. It also provides an interactive visualization tool to deep dive into the instance space.

Click through for the details. I’m going to wait for PyHard 2: PyHarder. Or maybe PyHardWithAVengeance. But it’ll all go downhill by the time we get to PyHard 5.

Comments closed

TensorFlow Fundamentals

Published 2021-10-08 by Kevin Feasel

Tanishka Garg starts a series on TensorFlow:

TensorFlow is an open-source end-to-end machine learning library. It is for preprocessing data, modeling data, and serving models (getting them into the hands of others).
It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML. And developers easily build and deploy ML-powered applications.

Read on for basic setup instructions and a primer on tensors.

Comments closed

Defect Detection with AWS Lookout and Sagemaker

Published 2021-10-07 by Kevin Feasel

Matthew Rhodes, et al, take us through an interesting case study:

According to a recent study, defective products cost industries over $2 billion from 2012–2017. Defect detection within manufacturing is an important business use case, especially in high-value product industries like the automotive industry. This allows for early diagnosis of anomalies to improve production line efficacy and product quality, and saves capital costs. Although advanced anomaly detection systems employ sensors as well as Internet of Things (IoT) devices to collect multimodal data to improve performance, computer vision continues to be a common approach. Detecting anomalies in automotive parts and components using computer vision can be done using normal images, and even X-Ray based images for structural damages. Recent advances in deep learning and computer vision have allowed scientists and manufacturers to develop enhanced anomaly detection systems, including surface defect detection on automotive body panels and dent detection in vehicles.

Read on for case notes.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31