Category: Machine Learning

Text Clustering with Python

Published 2022-07-28 by Kevin Feasel

Luke Menzies takes us through the gensim library:

An interesting branch of machine learning is Natural Language Processing (NLP). As the name suggests, it involves training machines to detect patterns in language using algorithms. It is quite often the case that NLP is referred to as text analytics. It is actually more impressive than that. It examines vectorised patterns which not only looks at the positioning of elements but what it means in context to neighbouring elements within the vector. In a nutshell, this technique can be extended beyond text to patterns of linguistics in general and even contextual patterns. Nevertheless, its primary use in the machine learning world is to analyse text.
This article will focus on an interesting application of NLP which involves the clustering of text. Clustering is a popular unsupervised machine learning technique used for segmentation or grouping of data. It is a very powerful tool that is used across a variety of industries. However, it is rare you hear of applying clustering to text. This can be achieved using NLP functions, combined with clustering algorithms that can handle non-Euclidian distances.

Read on for an overview of the process and an example of combining DBSCAN with word2vec to cluster phrases.

Comments closed

shapviz Package Updates

Published 2022-07-13 by Kevin Feasel

Michael Mayer announces updates to shapviz:

In a recent post, I introduced the initial version of the “shapviz” package. Its motto: do one thing, but do it well: visualize SHAP values.
The initial community feedback was very positive, and a couple of things have been improved in version 0.2.0. Here the main changes:

Read on for those changes.

Comments closed

The Seedy Underbelly of Machine Learning Fitting

Published 2022-07-08 by Kevin Feasel

John Mount is not impressed with a fair amount of machine learning:

For this to actually happen we need the actual system to be in our concept space, a lot of training data, and an abundance of caution.
In practice what we see more and more is the training procedure in fact attacks the evaluation procedure. It doesn’t just improve the quality of the fit artifact, but through mere optimization accidentally exploits weaknesses in the measurement system itself. When this happens, fitting does the following.

In ML training, we often accidentally “teach to the test” by comparing models via test data, which over time selects for models which are better fits for the test data. As John notes, this can come two separate ways and if you don’t define your optimization strategy correctly, you can accidentally train models which optimize on non-realistic things. A classic example is the neural network which could pick out malignant tumors from non-malignant tumors not because of any property of the tumor itself but rather because the malignant tumor images all had rulers in them and the non-malignant images did not. Read the whole thing for a second pitfall you can hit when training models.

Comments closed

PHI De-Identification in Databricks with NLP

Published 2022-06-24 by Kevin Feasel

Amir Kermany, et al, share a set of notebooks:

John Snow Labs, the leader in Healthcare natural language processing (NLP), and Databricks are working together to help organizations process and analyze their text data at scale with a series of Solution Accelerator notebook templates for common NLP use cases. You can learn more about our partnership in our previous blog, Applying Natural Language Processing to Health Text at Scale.
To help organizations automate the removal of sensitive patient information, we built a joint Solution Accelerator for PHI removal that builds on top of the Databricks Lakehouse for Healthcare and Life Sciences. John Snow Labs provides two commercial extensions on top of the open-source Spark NLP library — both of which are useful for de-identification and anonymization tasks — that are used in this Accelerator:

This is a really interesting scenario.

Comments closed

Saving and Loading a Keras Model

Published 2022-06-23 by Kevin Feasel

Jason Brownlee made it to a savepoint in time:

Given that deep learning models can take hours, days and even weeks to train, it is important to know how to save and load them from disk.
In this post, you will discover how you can save your Keras models to file and load them up again to make predictions.
After reading this tutorial you will know:
– How to save model weights and model architecture in separate files.
– How to save model architecture in both YAML and JSON format.
– How to save model weights and architecture into a single file for later use.

Read on for an updated step-by-step tutorial.

Comments closed

Example Data Pre-Processing Activities

Published 2022-06-22 by Kevin Feasel

Aayush Srivastava takes us through some pre-processing activities in machine learning:

After selecting the raw data for ML training, the most important task is data pre-processing. In broad sense, data preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We always need to preprocess our data so that it can be as per the expectation of machine learning algorithm

Read on for examples of pre-processing steps and how pre-processing differs from data cleaning.

Comments closed

Normalization Layers in Deep Learning Models

Published 2022-06-16 by Kevin Feasel

Zhe Ming Chng explains why data normalization matters in data science:

You’ve probably been told to standardize or normalize inputs to your model to improve performance. But what is normalization and how can we implement it easily in our deep learning models to improve performance? Normalizing our inputs aims to create a set of features that are on the same scale as each other, which we’ll explore more in this article.
Also, thinking about it, in neural networks, the output of each layer serves as the inputs into the next layer, so a natural question to ask is: If normalizing inputs to the model helps improve model performance, does standardizing the inputs into each layer help to improve model performance too?

Click through for the tutorial.

Comments closed

Visualizing SHAP Values in R with shapviz

Published 2022-06-14 by Kevin Feasel

Michael Mayer announces a new package:

SHAP (SHapley Additive exPlanations, Lundberg and Lee, 2017) is an ingenious way to study black box models. SHAP values decompose – as fair as possible – predictions into additive feature contributions.
When it comes to SHAP, the Python implementation is the de-facto standard. It not only offers many SHAP algorithms, but also provides beautiful plots. In R, the situation is a bit more confusing. Different packages contain implementations of SHAP algorithms

Read on to see how shapviz works, how to install it, and the types of visuals you can create from it.

Comments closed

The Value of MLOps

Published 2022-06-07 by Kevin Feasel

Tori Tompkins explains what MLOps is and why it’s valuable:

A ML project will typically begin in an ‘Explore Phase’ where a data scientist or team of data scientists will explore the data they currently have and experiment with models, algorithms, parameters and features. MLOps at this stage is responsible for supplying Data Scientists with environment they need to achieve this. One way this can be done is by leveraging Feature Store.
A feature store is a tool for storing commonly used features. As data scientists create new features then can log these into feature stores such as Feast and Databricks Feature Store, they can reuse these features across teams and projects. This will benefit teams in multiple ways by reducing compute times for both training and inference, provide consistency in common features and reducing effort for create complex logic.

Read on for information about all six phases.

Comments closed

ML Algorithms a Poor Fit for Predictive Caches

Published 2022-06-02 by Kevin Feasel

Pete Warden describes an interesting phenomenon:

I’ve been working on a new research paper, and a friend gave me the feedback that he was confused by the statement “memory accesses can be accurately predicted at the compilation stage” for machine learning workloads, and that this made them a poor fit for conventional processor architectures with predictive caches. I realized that this was received wisdom among the ML engineers I know, but I wasn’t aware of any papers that discuss this point. I put out a request for help on Twitter, but while there were a lot of interesting resources in the answers, I still couldn’t find any papers that focused on what feels like an important property for machine learning systems. With that in mind, I wanted to at least describe the issue as best as I can in this blog post, so there’s a trail of breadcrumbs for anyone else interested in how system designs might need to change to accommodate ML.

Read on for the explanation. My reading here is that this is a downside to having general-purpose compute: you run the risk of sub-optimal performance in certain circumstances, like training models using certain types of ML algorithms.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30