Press "Enter" to skip to content

Category: Data Science

Learning Naive Bayes

Sunil Ray explains the Naive Bayes algorithm:

What are the Pros and Cons of Naive Bayes?


  • It is easy and fast to predict class of test data set. It also perform well in multi class prediction
  • When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
  • It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).


  • If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.

  • On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.

  • Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

Read the whole thing.  Naive Bayes is such an easy algorithm, yet it works remarkably well for categorization problems.  It’s typically not the best solution, but it’s a great first solution.  H/T Data Science Central

Comments closed

Text Featurizing With Microsoft R Server

David Smith has a post summarizing sentiment analysis with Microsoft R Server:

Tsuyoshi Matsuzaki demonstrates the process in a post at the MSDN Blog. The post explores the Multi-Domain Sentiment Dataset, a collection of product reviews from The dataset includes reviews from 975,194 products on from a variety of domains, and for each product there is a text review and a star rating of 1, 2, 4, or 5. (There are no 3-star rated reviews in the data set.) Here’s one example, selected at random:

What a useful reference! I bought this book hoping to brush up on my French after a few years of absence, and found it to be indispensable. It’s great for quickly looking up grammatical rules and structures as well as vocabulary-building using the helpful vocabulary lists throughout the book. My personal favorite feature of this text is Part V, Idiomatic Usage. This section contains extensive lists of idioms, grouped by their root nouns or verbs. Memorizing one or two of these a day will do wonders for your confidence in French. This book is highly recommended either as a standalone text, or, preferably, as a supplement to a more traditional textbook. In either case, it will serve you well in your continuing education in the French language.

The review contains many positive terms (“useful”, “indespensable”, “highly recommended”), and in fact is associated with a 5-star rating for this book. The goal of the blog post was to find the terms most associated with positive (or negative) reviews. One way to do this is to use the featurizeText function in thje Microsoft ML package included with Microsoft R Client and Microsoft R Server. Among other things, this function can be used to extract ngrams (sequences of one, two, or more words) from arbitrary text. In this example, we extract all of the one and two-word sequences represented at least 500 times in the reviews. Then, to assess which have the most impact on ratings, we use their presence or absence as predictors in a linear model:

If you’re thinking about sentiment analysis, read the whole thing.

Comments closed

R Versus Python

Vincent Granville believes that Python is overtaking R in the realm of data science:

We use the app in question to compare search interest for R data Science versus Python Data Science, see above chart.  It looks like until December 2016, R dominated, but fell below Python by early 2017. The above chart displays an interest index, 100 being maximum and 0 being minimum. Click here to access this interactive chart on Google, and check the results for countries other than US, or even for specific regions such as California or New York.

Note that Python always dominated R by a long shot, because it is a general-purpose language, while R is a specialized language. But here, we compare R and Python in the niche context of data science. The map below shows interest for Python (general purpose) per region, using the same Google index in question.

It’s an interesting look at the relative shift between R and Python as a primary language for statistical analysis.

Comments closed

Tokenizing Text With R

Rachael Tatman shows how to tokenize a set of text as the first step in a natural language processing experiment:

In this tutorial you’ll learn how to:

  • Read text into R
  • Select only certain lines
  • Tokenize text using the tidytext package
  • Calculate token frequency (how often each token shows up in the dataset)
  • Write reusable functions to do all of the above and make your work reproducible

For this tutorial we’ll be using a corpus of transcribed speech from bilingual children speaking in English.  You can find more information on this dataset and download it here.

It’s a nice tutorial, especially because the data set is a bit of a mess.

Comments closed

One-Way ANOVA Testing With R

Bidyut Ghosh shows how to perform a one-way ANOVA test in R:

From the above results, it is observed that the F-statistic value is 17.94 and it is highly significant as the corresponding p-value is much less than the level of significance (1% or 0.01). Thus, it is wise to reject the null hypothesis of equal mean value of mileage run across all the tyre brands. In other words, the average mileage of the four tyre brands are not equal.
Now you have to find out the pair of brands which differ. For this you may use the Tukey’s HSD test.

ANOVA is a fairly simple test, but it can be quite useful to know.

Comments closed

Simpson’s Paradox Explained

Mehdi Daoudi, et al, have a nice explanation of Simpson’s Paradox:

E.H. Simpson first described the phenomenon of Simpson’s paradox in 1951. The actual name “Simpson’s paradox” was introduced by Colin R. Blyth in 1972. Blyth mentioned that:

G.W. Haggstrom pointed out that Simpson’s paradox is the simplest form of the false correlation paradox in which the domain of x is divided into short intervals, on each of which y is a linear function of x with large negative slope, but these short line segments get progressively higher to the right, so that over the whole domain of x, the variable y is practically a linear function of x with large positive slope.

The authors also provide a helpful example with operational metrics, showing how aggregating the data leads to an opposite (and invalid) conclusion.

Comments closed

Gradient Boosting In R

Anish Sing Walia walks us through a gradient boosting exercise using R:

An important thing to remember in boosting is that the base learner which is being boosted should not be a complex and complicated learner which has high variance for e.g a neural network with lots of nodes and high weight values.For such learners boosting will have inverse effects.

So I will explain Boosting with respect to decision trees in this tutorial because they can be regarded as weak learners most of the times.We will generate a gradient boosting model.

Click through for more details.  H/T R-Bloggers

Comments closed

Evaluating A Data Science Project

Tom Fawcett gives us an interesting evaluation of a data science case study:

The model is a fully connected neural network with three hidden layers, with a ReLU as the activation function. They state that data from Google Compute Engine was used to train the model (implemented in TensorFlow), and Cloud Machine Learning Engine’s HyperTune feature was used to tune hyperparameters.

I have no reason to doubt their representation choices or network design, but one thing looks odd. Their output is two ReLU (rectifier) units, each emitting the network’s accuracy (technically: recall) on that class. I would’ve chosen a single Softmax unit representing the probability of Large Loss driver, from which I could get a ROC or Precision-Recall curve. I could then threshold the output to get any achievable performance on the curve. (I explain the advantages of scoring over hard classification in this post.)

But I’m not a neural network expert, and the purpose here isn’t to critique their network design, just their general approach. I assume they experimented and are reporting the best performance they found.

Read the whole thing.

Comments closed

Regression Trees And Double Seasonal Time Series Trends

Peter Laurinec walks us through an example of using regression trees to solve a problem with double-seasonal time series data in R:

Classification and regression tree (or decision tree) is broadly used machine learning method for modeling. They are favorite because of these factors:

  • simple to understand (white box)
  • from a tree we can extract interpretable results and make simple decisions
  • they are helpful for exploratory analysis as binary structure of tree is simple to visualize
  • very good prediction accuracy performance
  • very fast
  • they can be simply tuned by ensemble learning techniques

But! There is always some “but”, they poorly adapt when new unexpected situations (values) appears. In other words, they can not detect and adapt to change or concept drift well (absolutely not). This is due to the fact that tree creates during learning just simple rules based on training data. Simple decision tree does not compute any regression coefficients like linear regression, so trend modeling is not possible. You would ask now, so why we are talking about time series forecasting with regression tree together, right? I will explain how to deal with it in more detail further in this post.

This was a very interesting article.  Absolutely worth reading.  H/T R-Bloggers

Comments closed

K Nearest Cliques

Vincent Granville explains an algorithm built around finding cliques of data points:

The cliques considered here are defined by circles (in two dimensions) or spheres (in three dimensions.) In the most basic version, we have one clique for each cluster, and the clique is defined as the smallest circle containing a pre-specified proportion p of the points from the cluster in question. If the clusters are well separated, we can even use p = 1. We define the density of a clique as the number of points per unit area. In general, we want to build cliques with high density.

Ideally, we want each cluster in the training set to be covered by a small number of (possibly slightly overlapping) cliques, each one having a high density.  Also, as a general rule, a training set point can only belong to one clique, and (ideally) to only one cluster. But the circles associated with two cliques are allowed to overlap.

It’s an interesting approach, and I can see how it’d be faster than K Nearest Neighbors, but I do wonder how accurate the results would be in comparison to KNN.

Comments closed