Press "Enter" to skip to content

Category: Data Science

Generating Random Numbers in R

Holger von Jouanne-Diedrich brings the noise:

In data science, we try to find, sometimes well-hidden, patterns (= signal) in often seemingly random data (= noise). Pseudo-Random Number Generators (PRNG) try to do the opposite: hiding a deterministic data generating process (= signal) by making it look like randomness (= noise). If you want to understand some basics behind the scenes of this fascinating topic, read on!

Click through for an explanation of the process.

Comments closed

k-gram Language Models in R

Valerio Gherardi takes us through the concept of k-grams:

The post is structured as follows: we start by giving a succinct theoretical introduction to kk-gram models. Subsequently, we illustrate how to train a kk-gram model in R using kgrams, and explain how to use the standard perplexity metric for model evaluation or tuning. Finally, we use our trained model to generate some random text at different temperatures.

This goes into some depth on the topic and is worth giving a careful read.

Comments closed

The Basics of k-Means Clustering

Nathaniel Schmucker explains some of the principles of k-means clustering:

k-Means is easy to implement. In R, you can use the function kmeans() to quickly deploy an efficient k-Means algorithm. On datasets of reasonable size (thousands of rows), the kmeans function runs in fractions of a second.

k-Means is easy to interpret (in 2 dimensions). If you have two features of your k-Means analysis (e.g., you are grouping by length and width), the result of the k-Means algorithm can be plotted on an xy-coordinate system to show the extent of each cluster. It’s easy to visually inspect the assignment to see if the k-Means analysis returned a meaningful insight. In more dimensions (e.g., length, width, and height) you will need to either create a 3D plot, summarize your features in a table, or find another alternative to describing your analysis. This loses the intuitive power that a 2D k-Means analysis has in convincing you or your audience that your analysis should be trusted. It’s not to say that your analysis is wrong; it simply takes more mental focus to understand what your analysis says.

The k-Means analysis, however, is not always the best choice. k-Means does well on data that naturally falls into spherical clusters. If your data has a different shape (linear, spiral, etc.), k-Means will force clustering into circles, which can result in outputs that defy human expectations. The algorithm is not wrong; we have fed the algorithm data it was never intended to understand.

There’s a lot of depth in this article which makes it really interesting.

Comments closed

Bayesian vs Frequentist Approaches to Machine Learning

Ajit Jaokar has an interesting series. Here’s part one:

The arguments / discussions between the Bayesian vs frequentist approaches in statistics are long running. I am interested in how these approaches impact machine learning. Often, books on machine learning combine the two approaches, or in some cases, take only one approach. This does not help from a learning standpoint.  

So, in this two-part blog we first discuss the differences between the Frequentist and Bayesian approaches. Then, we discuss how they apply to machine learning algorithms.

Part two extends from there:

Sampled from a distribution: Many machine learning algorithms make assumptions that the data is sampled from a frequency. For example, linear regression assumes gaussian distribution and logistic regression assumes that the data is sampled from a Bernoulli distribution. Hence, these algorithms take a frequentist approach

My biases push me toward Bayesian approaches, and I really like what I see in Stan, but these techniques do often require a lot more processing power.

1 Comment

Analyzing XGBoost Training Reports

Simon Zamarin, et al, walk us through using XGBoost reports in Amazon’s Sagemaker Debugger:

In 2019, AWS unveiled Amazon SageMaker Debugger, a SageMaker capability that enables you to automatically detect a variety of issues that may arise while a model is being trained. SageMaker Debugger captures model state data at specified intervals during a training job. With this data, SageMaker Debugger can detect training issues or anomalies by leveraging built-in or user-defined rules. In addition to detecting issues during the training job, you can analyze the captured state data afterwards to evaluate model performance and identify areas for improvement. This task is made easier with the newly launched XGBoost training report feature. With a minimal amount of code changes, SageMaker Debugger generates a comprehensive report outlining key information that you can use to evaluate and improve the model.

This post shows you an end-to-end example of training an XGBoost model on Sagemaker and how to enable the automatic XGBoost report functionality in Sagemaker Debugger to quickly and easily evaluate model performance and identify areas of improvement for your model. Even if you don’t have a lot of data science experience, you can still gauge how well the model performs and identify areas of improvement based on information provided by the report. The code from this post is available in the GitHub repo.

Click through for an example of this in action.

Comments closed

Applied ML Prototypes

Alex Bleakley and Santiago Giraldo announce Applied ML Prototypes:

To directly address these challenges, we’ve released Applied ML Prototypes (AMPs) — a revolutionary new way of developing and shipping enterprise ML use cases — which provide complete ML projects that can be deployed with one click directly from Cloudera Machine Learning. AMPs enable data scientists to go from an idea to a fully working ML use case in a fraction of the time, with an end-to-end framework for building, deploying, and monitoring business-ready ML applications instantly. 

AMPs move the starting line for any ML project by enabling data scientists to start with a full end-to-end project developed for a similar use case, including a trained and deployed ML model, as well as prebuilt predictive business applications, out of the box. This means that ML development teams can tackle their own ML business use cases more quickly, from those involving churn modeling, to sentiment analysis, to anomaly detection and beyond.

Getting past the marketing fluff, there are some interesting ideas here.

Comments closed

Tidying the Confusion Matrix in R

Gary Hutson has a new package for us:

The package aim is to make it easier to convert the outputs of the lists from caret and collapse these down into row-by-row entries, specifically designed for storing the outputs in a database or row by row data frame.

This is something that the CARET library does not have as a default and I have designed this to allow the confusion matrix outputs to be stored in a data frame or database, as many a time we want to track the ML outputs and fits over time to monitor feature slippage and changes in the underlying patterns of the data.

I like the way caret shows the confusion matrix when I’m reviewing result on my own, but I definitely appreciate efforts to make it easier to handle within code—similar to how broom reads linear regression outputs. H/T R-bloggers

Comments closed

Research with R and Production with Python

Matt Dancho and Jarrell Chalmers lay out an argument:

The decision can be challenging because they both Python and R have clear strengths.

R is exceptional for Research – Making visualizations, telling the story, producing reports, and making MVP apps with Shiny. From concept (idea) to execution (code), R users tend to be able to accomplish these tasks 3X to 5X faster than Python users, making them very productive for research.

Python is exceptional for Production ML – Integrating machine learning models into production systems where your IT infrastructure relies on automation tools like Airflow or Luigi.

They make a pretty solid argument. I’ve launched success R-based projects using SQL Server Machine Learning Services, but outside of ML Services, my team’s much more likely to deploy APIs in Python, and we’re split between Dash and Shiny for visualization. H/T R-Bloggers

Comments closed

Polychoric Correlation in Practice

Jack Davis explains the concept of polychoric correlation:

In polychoric correlation, we don’t need to know or specify where the boundary between “good” and “very good” is, just that it exists. The distribution of the ordinal responses, along with the assumption that the latent values follow a normal distribution, is enough that the polychor() function in the polycor R package can do that for us. In most practical cases, you don’t even need to know where the cutoffs are, but they are useful for demonstration that the method works.

Polychoric correlation estimates the correlation between such latent variables as if you actually knew what those values were. In the examples given, we start with the latent variables and use cutoffs to set them into bins, and then use polychoric on the artificially binned data. In any practical use case, the latent data would be invisible to you, and the cutoffs would be determined by whoever designed the survey.

Read on for a demonstration of the process in R.

Comments closed

K-Means and K-Medoids Clustering

Niti Sharma explains two clustering algorithms:

K-means and k-medoids are methods used in partitional clustering algorithms whose functionality works based on specifying an initial number of groups or, more precisely, iteratively by reallocation of objects among groups.

The algorithm works by first segregating all the points into an already selected number of clusters. The process is carried out by measuring the distance between the point and the center of each cluster. And because k-means can function only in the Euclidean space, the functionality of the algorithm is limited. Despite the drawbacks or shortcomings of algorithm possesses, k-means is still one of the most powerful tools used in clustering. The applications can be seen widely used in multiple fields – physical sciences, natural language processing (NLP), and healthcare.

k-means is a fairly common algorithm, but you hear less about k-medoids—it’s the more robust alternative to k-means.

Comments closed