Press "Enter" to skip to content

Category: Data Science

Using complete.cases in R

Steven Sanderson has no time for missing data:

Data analysis in R often involves dealing with missing values, which can significantly impact the quality of your results. The complete.cases function in R is an essential tool for handling missing data effectively. This comprehensive guide will walk you through everything you need to know about using complete.cases in R, from basic concepts to advanced applications.

Using complete.cases to find observations with missing values is great. Using it to eliminate observations with missing values can sometimes be helpful, depending on just how many missing values you have.

Leave a Comment

Cosine Similarity in Power Query

John Kerski searches for similar sets:

I’ll admit upfront—I am not a data scientist by trade. Instead, I’ve picked up my data science skills over time, learning through a combination of osmosis from talented colleagues and tackling real-world data challenges. It’s been a journey of trial, error, and refinement, as I’ve worked to bridge gaps between complex data science techniques and tools available to me.

Recently, my skills were put to the test when I needed to compare hundreds of Active Directory and SharePoint Groups to find similarities in their memberships. With only Power Query available in the production environment, no Python or R to ease the process, I faced the task of finding a method to finding similarities from scratch in Power Query. In this guide, I’ll walk you through the solution I developed, highlighting the steps that made it possible.

John came up with a very clever solution. By the way, the way I like to explain cosine similarity (as a concept, not the algorithm itself) is as follows.

Back in high school physics, you probably drew vectors and learned that vectors have a direction and a magnitude (length). We drew vectors in two-dimensional space because that’s easy: it’s a line on a sheet of paper and there’s an arrow at the end to denote the direction of that vector. Conceptually, vectors with more than two dimensions behave exactly the same; the difference is that we cannot simply draw them, especially once we get past three-dimensional space (a vector with three elements). But the concept is still there: every vector has a direction and a magnitude.

We use cosine similarity to compare two vectors and see how close those two vectors are in terms of angle (direction), with the idea being that magnitude isn’t as important as angle for determining vector similarity. This is in contrast to another technique like Euclidean distance, which focuses more on the magnitude of the vectors versus angle.

Leave a Comment

Building and Deploying a Streamlit Data App

Ivan Palomares Carrascosa deploys an app:

This article will navigate you through the deployment of a simple machine learning (ML) for regression using Streamlit. This novel platform streamlines and simplifies deploying artifacts like ML systems as Web services.

I’ll leave aside my aside that linear regression isn’t machine learning. Click through to see how you can build a simple application in approximately 60 lines of code. This example shows off some of the simplicity in Streamlit’s design.

Leave a Comment

Churn Analysis using Logistic Regression in Python

Daniel Calbimonte takes us through a churn analysis scenario:

This article explains how to analyze the data using Python and perform customer churn analysis to determine why customers stop using a service.

Read on for the article. If you want to dig deeper into churn analysis, I can recommend a book entitled Fighting Churn with Data. Its focus is more on categorical and numerical analysis rather than using statistical classification techniques like logistic regression to identify churn factors. That makes it easier to digest for non-statisticians, especially because most of the code is SQL.

Leave a Comment

An Explanation of Boosting, Bagging, and Stacking Ensembles

Ivan Palomares Carrascosa disambiguates three terms:

Unity makes strength. This well-known motto perfectly captures the essence of ensemble methods: one of the most powerful machine learning (ML) approaches -with permission from deep neural networks- to effectively address complex problems predicated on complex data, by combining multiple models for addressing one predictive task. This article describes three common ways to build ensemble models: boosting, bagging, and stacking. Let’s get started!

My explanation, which makes sense for people who grew up during the 1980s: bagging is Voltron, boosting is Rocky, and stacking is three racoons in a trench coat.

Leave a Comment

An Overview of the Naive Bayes Class of Algorithms

Harris Amjad takes us through a rather useful class of algorithms for classification:

As AI and Machine Learning have increased in popularity, especially Large Language Models, more professionals have explored how these systems work. Unfortunately, some put the cart before the horse, where they take on more complex algorithms before learning to pave the foundation, resulting in faded interest in the topic. This tip will introduce a simple probabilistic, yet powerful classifier, the Naïve Bayes Model, and implement it in Python.

I like using the Naive Bayes variants, despite the fact that it is not Bayesian and arguably isn’t very naive. The reason I like to use this class of algorithm is that it’s fast, easy, and gives you a useful baseline for quality. If you need to meet some specific quality threshold (say, accuracy > 85% or F1-score above 0.8), you can get an answer quickly with Naive Bayes. If that answer is anywhere near your threshold, the problem is likely solvable. If your answer is way below the threshold, it’s probably not worth spending the time or compute effort trying out a variety of other algorithms.

Comments closed

A Primer on Outlier Detection

Jayita Gulati provides an overview:

Anomaly detection means finding patterns in data that are different from normal. These unusual patterns are called anomalies or outliers. In large datasets, finding anomalies is harder. The data is big, and patterns can be complex. Regular methods may not work well because there is so much data to look through. Special techniques are needed to find these rare patterns quickly and easily. These methods help in many areas, like banking, healthcare, and security.

Let’s have a concise look at anomaly detection techniques for use on large scale datasets. This will be no-frills, and be straight to the point in order for you to follow up with additional materials where you see fit.

Outlier detection is a large an interesting space. I suppose I should shill for myself a little bit and note that I wrote a book on the topic. This post provides some quick guidance around outlier detection techniques and applications, and serves as a fine starting point for digging in further.

Comments closed

Monitoring R Models in Production with Vetiver

Myles Mitchell continues a series on Vetiver:

In those blogs, we introduced the {vetiver} package and its use as a tool for streamlined MLOps. Using the {palmerpenguins} dataset as an example, we outlined the steps of training a model using {tidymodels} then converting this into a {vetiver} model. We then demonstrated the steps of versioning our trained model and deploying it into production.

Getting your first model into production is great! But it’s really only the beginning, as you will now have to carefully monitor it over time to ensure that it continues to perform as expected on the latest data. Thankfully, {vetiver} comes with a suite of functions for this exact purpose!

Click through for the full story.

Comments closed

A Survey of Predictive Analytics Techniques

Akmal Chaudhri tries a bunch of things:

In this short article, we’ll explore loan approvals using a variety of tools and techniques. We’ll begin by analyzing loan data and applying Logistic Regression to predict loan outcomes. Building on this, we’ll integrate BERT for Natural Language Processing to enhance prediction accuracy. To interpret the predictions, we’ll use SHAP and LIME explanation frameworks, providing insights into feature importance and model behavior. Finally, we’ll explore the potential of Natural Language Processing through LangChain to automate loan predictions, using the power of conversational AI.

Click through for the notebook, as well as an overview of what the notebook includes. I don’t particularly like word clouds as the “solution” in the BERT example, though without real data to perform any sort of NLP, there’s not much you can meaningfully do.

Comments closed

RandomWalker 0.2.0 Release

Steven Sanderson makes an announcement:

In the ever-evolving landscape of R programming, packages continually refine their capabilities to meet the growing demands of data analysts and researchers. Today, we’re excited to announce the release of RandomWalker version 0.2.0, a minor update that brings significant enhancements to time series analysis and random walk simulations.

RandomWalker has been a go-to package for R users in finance, economics, and other fields dealing with time-dependent data. This latest release introduces new functions and improvements that promise to streamline workflows and provide deeper insights into time series data.

Read on to see what has changed.

Comments closed