Press "Enter" to skip to content

Category: Machine Learning

Topic Modeling with Python

Sanil Mhatre takes us through topic modeling:

Topic modeling is a powerful Natural Language Processing technique for finding relationships among data in text documents. It falls under the category of unsupervised learning and works by representing a text document as a collection of topics (set of keywords) that best represent the prevalent contents of that document. This article will focus on a probabilistic modeling approach called Latent Dirichlet Allocation (LDA), by walking readers through topic modeling using the team health demo dataset. Demonstrations will use Python and a Jupyter notebook running on Anaconda. Please follow instructions from the “Initial setup” section of the previous article to install Anaconda and set up a Jupyter notebook.

The second article of this series, Text Mining and Sentiment Analysis: Power BI Visualizations, introduced readers to the Word Cloud, a common technique to represent the frequency of keywords in a body of text. Word Cloud is an image composed of keywords found within a body of text, where the size of each word indicates its frequency in that body of text. This technique is limited in its ability to discover underlying topics and themes in the text, because it only relies on the frequency of keywords to determine their popularity. Topic modeling overcomes these limitations and uncovers deeper insights from text data using statistical modeling for discovering the topics (collection of words) that occur in text documents.

Read on for an informative article with plenty of code.

Comments closed

An Overview of Automatic Text Summarization

Kevin Jacobs looks at the state of the art with respect to automatic text summarization:

Automatic text summarization comes in two flavours: extractive summarization and abstractive summarization. Extractive summarization models take exact phrases from the reference documents and use them as a summary. One of the very first research papers on (extractive) text summarization is the work of Luhn [1]. TextRank [2] (based on the concepts used by the PageRank algorithm) is another widely used extractive summarization model.

In the era of deep learning, abstractive summarization became a reality. With abstractive summarization, a model generates a text instead of using literal phrases of the reference documents. One of the more recent works on abstractive summarization is PEGASUS [3] (a demo is available at HuggingFace).

Click through for a couple contemporary examples as well as a few pain points you can experience when using the current set of libraries and algorithms.

Comments closed

A Conceptual Discussion of Active Learning

Kevin Jacobs teaches us to learn:

Active Learning is a method in which data is annotated in s smart way. With data annotation, you would normally get to see a randomly selected item which you need to label. This however can lead to a lot of repetition of similar items which you have to label. This is a waste of time. A better way would be to use Active Learning. For Active Learning, a batch of random items is selected first. Then, a lightweight classifier is used for evaluating the previously annotated data.

Basically, run your prediction mechanism, find the things about which the mechanism is least certain, and figure those out. Doing this reduces ambiguity and quickly leads to a better model.

Comments closed

Building a Recommender in Spark

Avinash Sooriyarachchi makes a recommendation:

There has been an exponential increase in the volume and variety of data at our disposal to build recommenders and notable advances in compute and algorithms to utilize in the process. Particularly, the means to store, process and learn from image data has dramatically increased in the past several years. This allows retailers to go beyond simple collaborative filtering algorithms and utilize more complex methods, such as image classification and deep convolutional neural networks, that can take into account the visual similarity of items as an input for making recommendations. This is especially important given online shopping is a largely visual experience and many consumer goods are judged on aesthetics.

In this article, we’ll change the script and show the end-to-end process for training and deploying an image-based similarity model that can serve as the foundation for a recommender system. Furthermore, we’ll show how the underlying distributed compute available in Databricks can help scale the training process and how foundational components of the Lakehouse, Delta Lake and MLflow, can make this process simple and reproducible.

Click through for the process.

Comments closed

Scoring Azure ML Models in Azure Synapse Analytics

Alex Aleksandrov shows off the PREDICT operator:

We can use Synapse for many activities. We can use it not only for ingesting, querying, storing and visualising data, but for developing machine learning models as well. Of course, one can say that doing data science is another functionality of this platform and this is definitely true. However, in this article, I would like to show you that instead of using Python, one can use T-SQL for doing predictions.

Click through to see how.

Comments closed

AutoML with pycaret

Brendan Tierney looks at the pycaret library:

In this post we will have a look at using the AutoML feature in the Pycaret Python library. AutoML is a popular topic and allows Data Scientists and Machine Learning people to develop potentially optimized models based on their data. All requiring the minimum of input from the Data Scientist. As with all AutoML solutions, care is needed on the eventual use of these models. With various ML and AI Legal requirements around the World, it might not be possible to use the output from AutoML in production. But instead, gives the Data Scientists guidance on creating an optimized model, which can then be deployed in production. This facilitates requirements around model explainability, transparency, human oversight, fairness, risk mitigation and human in the loop.

Read on for a tutorial as well as additional resources.

Comments closed

Form Recognizer Updates

Vinod Kurpad shares some news:

Form Recognizer continues to improve product capabilities with improved models, support for additional document types and containerized solutions that run in the cloud or on premises either connected or fully disconnected for scenarios where containers need to run in an isolated environment. Recent updates to pricing include commitment tiers for customers who have a predictable volume of documents. Starting February 15th, the pricing for Invoices and General Document API will drop to $10 per 1000 pages, an 80% reduction, making it possible for customers to use invoices and the general document APIs for high volume scenarios to significantly lower cost while providing additional value.

That’s a pretty big improvement.

Comments closed

Multivariate Anomaly Detection in SynapseML

Louise Han has an announcement:

Today, we are excited to announce a wonderful collaborated feature between Multivariate Anomaly Detector and  SynapseML , which joined together to provide a solution for developers and customers to do multivariate anomaly detection in Synapse. This new capability allows you to detect anomalies quickly and easily in very large datasets and databases, perfectly lighting up scenarios like equipment predictive maintenance. For those who is not familiar with predictive maintenance, it is a technique that uses data analysis tools and techniques to detect anomalies in the operation and possible defects in equipment and processes so customers can fix them before they result in failure. Therefore, this new capability will benefit customers who have a huge number of sensor data within hundreds of pieces of equipment, to do equipment monitor, anomaly detection, and even root cause analysis.

Click through for more details and a demonstration on how to use it.

Comments closed

The Architecture of Project Bansai

Tsuyoshi Matsuzaki takes us through the architecture for Project Bansai:

Project Bonsai is a reinforcement learning framework for machine teaching in Microsoft Azure.

In generic reinforcement learning (RL), data scientists will combine tools and utilities (such like, Gym, RLlib, Ray, etc) which can be easily customized with familiar Python code and ML/AI frameworks, such as, TensorFlow or PyTorch.
But, in engineering tasks with machine teaching for autonomous systems or intelligent controls, data scientists will not always explore and tune attributes for AI. In successful practices, the professionals for operations or engineering (non-AI specialists) will tune attributes for some specific control systems (simulations) to train in machine teaching, and data scientists will assist in cases where the problem requires advanced solutions.

Read on to see how it works.

Comments closed

Azure ML and MLOps

I continue a series on Azure ML:

We ended the prior series with model deployment via the Azure ML Studio UI. This is entirely manual and UI-driven. Then, we looked at model deployment via manually-run notebooks. This is still manual but at least offers the possibility of automation as we control the code to run.

From there, we moved to model deployment via the Azure CLI and Python SDK. Now we have the capability to run, train, register, and deploy models via scripts. This leads to the next phase in the process, in which we can perform continuous integration and continuous deployment of models using a tool like Azure DevOps or GitHub Actions. This is where MLOps starts to shine.

Read on for a few thoughts about MLOps and software maturity.

Comments closed