Press "Enter" to skip to content

Category: Data Science

K-Means and K-Medoids Clustering

Niti Sharma explains two clustering algorithms:

K-means and k-medoids are methods used in partitional clustering algorithms whose functionality works based on specifying an initial number of groups or, more precisely, iteratively by reallocation of objects among groups.

The algorithm works by first segregating all the points into an already selected number of clusters. The process is carried out by measuring the distance between the point and the center of each cluster. And because k-means can function only in the Euclidean space, the functionality of the algorithm is limited. Despite the drawbacks or shortcomings of algorithm possesses, k-means is still one of the most powerful tools used in clustering. The applications can be seen widely used in multiple fields – physical sciences, natural language processing (NLP), and healthcare.

k-means is a fairly common algorithm, but you hear less about k-medoids—it’s the more robust alternative to k-means.

Comments closed

Reporting on Correlation Analysis in R

Petr Baranovskiy continues a series on correlation analysis using R:

This is the second part of the Correlation Analysis in R series. In this post, I will provide an overview of some of the packages and functions used to perform correlation analysis in R, and will then address reporting and visualizing correlations as text, tables, and correlation matrices in online and print publications.

Read the whole thing.

Comments closed

Model Post-Processing with insight

The easystats team talks about the insight package in R:

We are talking about the insight package. It is what allows other packages, like easystats (parameterseffectsizeperformancereport, …) or ggstatsplotsjstats or modelsummary to be as powerful as they are, supporting tons of different R models. So why make you life hard when you can be like them, and rely on insight?

It is made for developers (and users) that do some postprocessing of different models (e.g., extracting stuff like parameters, values, data, names, specifications, predictions, priors, etc.), whether it is to nicely display their results or to do further computation.

Click through for an example of what it does and how it works. H/T R-bloggers

Comments closed

Determining a Good Test Set Size

John Mount thinks about test set size:

In this note we will answer “what is a good test set size?” three ways.

– The usual practical answer.
– A decision theory answer.
– A novel variational answer.

Each of these answers is a bit different, as they are solved in slightly different assumed contexts and optimizing different objectives. Knowing all 3 solutions gives us some perspective on the problem.

My rule of thumb is that I want it to be as small as possible while containing the highest likelihood of hitting all real-world scenarios enough times to provide a valid comparison. This conversely maximizes the size of the training data set, giving us the best chance of seeing the widest variety of scenarios we can during the formative phase.

And as usual, John goes way deeper than my rules of thumb. I like this post a lot.

Comments closed

Power BI: New Features for Data Analysts

Tomaz Kastrun looks at some new functionality in Power BI which might interest data analysts:

Small multiples is a layout of small charts over a grouping variable, aligned side-by-side, sharing common scale, that is scaled to fit all the values (by grouping or categorical variable) on multiple smaller graphs. Analyst should immediately see and tell the difference between the grouping variable (e.g.: city, color, type,…) give a visualized data.

In Python, we know this as trellis plot or FacetGrid (seaborn) or simply subplots (Matplotlib).

In R, this is usually referred to as facets (ggplot2).

Read on for an example of this, as well as two other features, as well as how you might have worked with these ideas in Python and R.

Comments closed

Gradient Descent in R

Holger von Jouanne-Diedrich lays out the basics of gradient descent:

Gradient Descent is a mathematical algorithm to optimize functions, i.e. finding their minima or maxima. In Machine Learning it is used to minimize the cost function of many learning algorithms, e.g. artificial neural networks a.k.a. deep learning. The cost function simply is the function that measures how good a set of predictions is compared to the actual values (e.g. in regression problems).

The gradient (technically the negative gradient) is the direction of steepest descent. Just imagine a skier standing on top of a hill: the direction which points into the direction of steepest descent is the gradient!

Click through for an example in R.

Comments closed

Hyperparameter Tuning as Technical Debt

John Mount has an interesting take on hyperparameter tuning:

The hyper dance is the venial trick of pushing user facing technical debt and flaws as user controllable features. These controls are usually named “hyper parameters” and they are parameters or arguments that control the behavior of an algorithm. Users think “hyper parameters” must be even better than “regular parameters”, just like “hyper drive” is better than “sub-light drive.” However the etymology of the name isn’t from science fiction, it is just the need in statistical contexts to have a name for controls other than parameter, as parameter is often used to name the fit coefficients of a model (i.e. to name an output, not an input!).

In addition to this, I’d be concerned that heavy hyperparameter tuning could lead to a garden of forking paths problem where we end up accidentally doing the equivalent of p-hacking: modifying hyperparameters until we come up with the “right” answer.

Comments closed

The Nature of Overfitting

John Mount has a nice essay on overfitting:

What is meant by “overfitting” is: the estimated f() will tend to show off or over perform on the data used to fit, train, or construct it. I have some notes on this sort of selection bias here: https://win-vector.com/2020/12/10/overfit-and-reversion-to-mediocrity-the-bane-of-data-science/.

Selecting a model that “looks good” is enough to bias the model’s evaluation with respect to the data set we said it “looked good” on. So even when using unbiased methods, the data scientist can introduce bias by choosing to use one model (say the one fit by logistic regression) over another (say using using an observed prevalence everywhere as a probability prediction).

The way I talk about overfitting is to say that we’ve trained a model which latches onto the particulars of the training data set. To the extent that the particulars of the training data set are matched by the broader world, that’s “fitting.” To the extent that the particulars of the training data set are unique to that data set and are not generally applicable, that’s “overfitting.” Generally, I don’t have any more time to get into what this means, but John dives into the topic in an accessible way.

Comments closed

Basic Theory on Correlation Analysis, Using R

Petr Baranovskiy wants to take us through the key concepts of correlation analysis, starting with basic theory:

When I was learning statistics, I was surprised by how few learning materials I personally found to be clear and accessible. This might be just me, but I suspect I am not the only one who feels this way. Also, everyone’s brain works differently, and different people would prefer different explanations. So I hope that this will be useful for people like myself – social scientists and economists – who may need a simpler and more hands-on approach.

These series are based on my notes and summaries of what I personally consider some the best textbooks and articles on basic stats, combined with the R code to illustrate the concepts and to give practical examples. Likely there are people out there whose cognitive processes are similar to mine, and who will hopefully find this series useful.

This is clear and well-written, so check it out even if you feel like you have a solid understanding of the topic.

Comments closed

Bayesian Modeling of Holiday Behavior

Daniel Marthaler and Brian Coffey have an interesting post:

As the year unfolds, our demand fluctuates. Two big drivers of that fluctuation are seasonality and holidays. With the holiday season upon us, it’s a great time to describe how both seasonality and holiday effects can be estimated, and how you can use this formulation in a predictive time series model.

In this post, we describe the difference between seasonality and holiday effects, posit a general Bayesian Holiday Model, and show how that model performs on some Google Trends data.

Read the whole thing.

Comments closed