Press "Enter" to skip to content

Category: Data Science

Principal Component Analysis With Faces

Mic at The Beginner Programmer shows us how to creepy PCA diagrams with human faces:

PCA looks for a new the reference system to describe your data. This new reference system is designed in such a way to maximize the variance of the data across the new axis. The first principal component accounts for as much variance as possible, as does the second and so on. PCA transforms a set of (tipically) correlated variables into a set of uncorrelated variables called principal components. By design, each principal component will account for as much variance as possible. The hope is that a fewer number of PCs can be used to summarise the whole dataset. Note that PCs are a linear combination of the original data.

The procedure simply boils down to the following steps

  1. Scale (normalize) the data (not necessary but suggested especially when variables are not homogeneous).

  2. Calculate the covariance matrix of the data.

  3. Calculate eigenvectors (also, perhaps confusingly, called “loadings”) and eigenvalues of the covariance matrix.

  4. Choose only the first N biggest eigenvalues according to one of the many criteria available in the literature.

  5. Project your data in the new frame of reference by multipliying your data matrix by a matrix whose columns are the N eigenvectors associated with the N biggest eigenvalues.

  6. Use the projected data (very confusingly called “scores”) as your new variables for further analysis.

I like the explanations provided, and the data set is definitely something I’m not used to seeing with PCA.  H/T R-bloggers

Comments closed

Using Uncertainty For Model Interpretation

Yoel Zeldes and Inbar Naor explain how uncertainty can help you understand your models better:

One prominent example is that of high risk applications. Let’s say you’re building a model that helps doctors decide on the preferred treatment for patients. In this case we should not only care about the accuracy of the model, but also about how certain the model is of its prediction. If the uncertainty is too high, the doctor should to take this into account.

Self-driving cars are another interesting example. When the model is uncertain if there is a pedestrian on the road we could use this information to slow the car down or trigger an alert so the driver can take charge.

Uncertainty can also help us with out of data examples. If the model wasn’t trained using examples similar to the sample at hand it might be better if it’s able to say “sorry, I don’t know”. This could have prevented the embarrassing mistake Google photos had when they misclassified African Americans as gorillas. Mistakes like that sometimes happen due to an insufficiently diverse training set.

The last usage of uncertainty, which is the purpose of this post, is as a tool for practitioners to debug their model. We’ll dive into this in a moment, but first, let’s talk about different types of uncertainty.

Interesting argument.

Comments closed

Naive Bayes In Python

Kislay Keshari explains the Naive Bayes algorithm and shows an implementation in Python:

Naive Bayes in the Industry

Now that you have an idea of what exactly Naive Bayes is and how it works, let’s see where it is used in the industry.

RSS Feeds

Our first industrial use case is News Categorization, or we can use the term ‘text classification’ to broaden the spectrum of this algorithm. News on the web is rapidly growing where each news site has its own different layout and categorization for grouping news. Companies use a web crawler to extract useful text from HTML pages of news articles to construct a Full Text RSS. The contents of each news article is tokenized (categorized). In order to achieve better classification results, we remove the less significant words, i.e. stop, from the document. We apply the naive Bayes classifier for classification of news content based on news code.

It’s a good overview of the topic and a particular implementation in Python.  Naive Bayes is a technique which you want in the bag:  there are a lot of techniques which tend to be better in specific domains, but Naive Bayes is easy to implement and usually provides acceptable performance.

Comments closed

Bayesian Approaches To The Cold Start Problem

John Cook explains what you can do with data-driven applications when you don’t yet have the data:

How do you operate a data-driven application before you have any data? This is known as the cold start problem.

We faced this problem all the time when I designed clinical trials at MD Anderson Cancer Center. We used Bayesian methods to design adaptive clinical trial designs, such as clinical trials for determining chemotherapy dose levels. Each patient’s treatment assignment would be informed by data from all patients treated previously.

But what about the first patient in a trial? You’ve got to treat a first patient, and treat them as well as you know how. They’re struggling with cancer, so it matters a great deal what treatment they are assigned. So you treat them according to expert opinion. What else could you do?

Read on for John’s solution.

Comments closed

A Geometric Depiction Of Covariance

Nikolai Janakiev explains the concept of the covariance matrix using a bit of Python and some graphs:

In this article we saw the relationship of the covariance matrix with linear transformation which is an important building block for understanding and using PCASVD, the Bayes Classifier, the Mahalanobis distance and other topics in statistics and pattern recognition. I found the covariance matrix to be a helpful cornerstone in the understanding of the many concepts and methods in pattern recognition and statistics.

Many of the matrix identities can be found in The Matrix Cookbook. The relationship between SVD, PCA and the covariance matrix are elegantly shown in this question.

Understanding covariance is critical for a number of statistical techniques, and this is a good way of describing it.

Comments closed

Calculating Cohort Lifetime Value With Excel And R

Eleni Markou shows how to calculate the lifetime value of a group of customers using two techniques:

A lot of ink has been spilled in developing various descriptions of the LTV, the majority of which ends up with mathematical formulas that are based on margin (m), retention rate (r) and discount rate (d) like the following (here):

However, this model appears to be not that realistic as it is based on a few quite restrictive assumptions:

  • Retention is assumed to be constant during the lifetime of a customer, i.e. the probability r of remaining retained remains the same across all months.
  • An infinite time horizon is assumed when calculating the present value of future cash flows.
  • The unit economics are supposed to be constant throughout lifetime which leads to a constant contribution margin.

Yet when dealing with an actual company, it easily becomes evident that none of the aforementioned conditions actually hold. Especially in early-stage businesses the size of the time periods across which you would like to calculate the LTV is month – or week – sized while at the same time the retention rate across them can vary significantly as the company’s products evolve quickly.

There’s a lot packed into that article, so give it a read.

Comments closed

Exploratory Time Series Analysis

The authors at Knoyd have a post on exploratory data analysis of a time series data set:

From the plot above we can clearly see that time-series has strong seasonal and trend components. To estimate the trend component we can use a function from the pandas library called rolling_mean and plot the results. If we want to make the plot more fancy and reusable for another time-series it is a good idea to make a function. We can call this function plot_moving_average.

The second part of the series promises to use Box-Jenkins to forecast future values.

Comments closed

Implementing K Nearest Neighbors In Python

Atul Harsha gives us a demo on k nearest neighbors in Python:

In order to make any predictions, you have to calculate the distance between the new point and the existing points, as you will be needing k closest points.

In this case for calculating the distance, we will use the Euclidean distance. This is defined as the square root of the sum of the squared differences between the two arrays of numbers

Specifically, we need only first 4 attributes(features) for distance calculation as the last attribute is a class label. So for one of the approach is to limit the Euclidean distance to a fixed length, thereby ignoring the final dimension.

Check it out.

Comments closed

Explaining Text Classification Models With LIME

Shirin Glander shows us how to use LIME to explain which words help us classify whether a user liked a particular item:

Okay, not a perfect score but good enough for me – right now, I’m more interested in the explanations of the model’s predictions. For this, we need to run the lime() function and give it

  • the text input that was used to construct the model
  • the trained model
  • the preprocessing function
explainer <- lime(clothing_reviews_train$text, xgb_model, preprocess = get_matrix)

With this, we could right away call the interactive explainer Shiny app, where we can type any text we want into the field on the left and see the explanation on the right: words that are underlined green support the classification, red words contradict them.

I hadn’t used LIME for this before, and it looks very interesting.  H/T R-Bloggers

Comments closed

Visualizing Linear Regression Results

Bernardo Lares gives us a few ways of interpreting visually a linear regression result in R:

The most obvious plot to study for a linear regression model, you guessed it, is the regression itself. If we plot the predicted values vs the real values we can see how close they are to our reference line of 45° (intercept = 0, slope = 1). If we’d had a very sparse plot where we can see no clear tendency over that line, then we have a bad regression. On the other hand, if we have all our points over the line, I bet you gave the model your wished results!

Then, the Adjusted R2 on the plot gives us an easy parameter for us to compare models and how well did it fits our reference line. The nearer this value gets to 1, the better. Without getting too technical, if you add more and more useless variables to a model, this value will decrease; but, if you add useful variables, the Adjusted R-Squared will improve.

We also get the RMSE and MAE (Root-Mean Squared Error and Mean Absolute Error) for our regression’s results. MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. On the other side we have RMSE, which is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation. Both metrics can range from 0 to ∞ and are indifferent to the direction of errors. They are negatively-oriented scores, which means lower values are better.

I like this approach to explaining models.

Comments closed