Press "Enter" to skip to content

Category: Data Science

R And Python: Two Growing Languages

David Smith notes that as fast as Python is growing, R is as well:

Python has been getting some attention recently for its impressive growth in usage. Since both R and Python are used for data science, I sometimes get asked if R is falling by the wayside, or if R developers should switch course and learn Python. My answer to both questions is no.

First, while Python is an excellent general-purpose data science tool, for applications where comparative inference and robust predictions are the main goal, R will continue to be the prime repository of validated statistical functions and cutting-edge research for a long time to come. Secondly, R and Python are both top-10 programming languages, and while Python has a larger userbase, R and Python are both growing rapidly — and at similar rates.

I had a discussion about this last night.  I like the language diversity:  R is more statistician-oriented, whereas Python is more developer-oriented.  They both can solve the same set of problems, but there are certainly cases where one beats the other.  I think Python will end up being the more popular language for data science because of the number of application developers moving into the space, but for the data analysts and academicians moving to this field, R will likely remain the more interesting language.

Comments closed

ANOVA

Mala Mahadevan explains what ANOVA is and why it’s interesting:

ANOVA – or analysis of variance, is a term given to a set of statistical models that are used to analyze differences among groups and if the differences are statistically significant to arrive at any conclusion. The models were developed by statistician and evolutionary biologist Ronald Fischer. To give a very simplistic definition – ANOVA is an extension of the two way T-Test to multiple cases.

ANOVA is an older test and a fairly simple process, but is quite useful to understand.

Comments closed

Neural Nets Optimizing For Imperfect

John Cook describes a paradox with neural nets:

Deep neural networks have enough parameters to overfit the data, but there are various strategies to keep this from happening. A common way to avoid overfitting is to deliberately do a mediocre job of fitting the model.

When it works well, the shortcomings of the optimization procedure yield a solution that differs from the optimal solution in a beneficial way. But the solution could fail to be useful in several ways. It might be too far from optimal, or deviate from the optimal solution in an unhelpful way, or the optimization method might accidentally do too good a job.

Conceptually, this feels a little weird but isn’t really much of a problem, as we have other analogues:  rational ignorance in economics (where we knowingly choose not to know something because the benefit is not worth the opportunity cost of learning), OPTIMIZE FOR UNKNOWN with SQL Server (where we knowingly do not use the passed-in parameter because we might get stuck in a lesser path), etc.  But the specific process here is interesting.

Comments closed

Imbalanced Data In R

Rathnadevi Manivannan explains how to deal with imbalanced data using R:

Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Imbalanced classification occurs more frequently in binary classification than in multi-level classification. For example, extreme imbalanced data can be seen in banking or financial data where majority credit card uses are acceptable and very few credit card uses are fraudulent.

With an imbalanced dataset, the information required to make an accurate prediction about the minority class cannot be obtained using an algorithm. So, it is recommended to use balanced classification dataset.

Rathnadevi uses fraudulent transactions for his sample, but medical diagnoses is also a good example:  suppose 1 person in 10,000 has a particular disease.  You’re 99.99% right if you just say nobody has the disease, but that’s a rather unhelpful model.

Comments closed

Sentiment Analysis In R

Rachel Tatman has a great tutorial introducing sentiment analysis in R:

By the end of this tutorial you will:

  • Understand what sentiment analysis is and how it works
  • Read text from a dataset & tokenize it
  • Use a sentiment lexicon to analyze the sentiment of texts
  • Visualize the sentiment of text

If you’re the hands-on type, you might want to head directly to the notebook for this tutorial. You can fork it and have your very own version of the code to run, modify and experiment with as we go along.

Check it out.  There’s a lot more to sentiment analysis—cleaning and tokenizing words, getting context right, etc.—but this is a very nice introduction.

Comments closed

Fun With The Beta Distribution

John D. Cook shows how one chatoic equation just happens to follow a beta distribution:

Indeed the points do bounce all over the unit interval, though they more often bounce near one of the ends.

Does that distribution look familiar? You might recognize it from Bayesian statistics. It’s a beta distribution. It’s symmetric, so the two beta distribution parameters are equal. There’s a vertical asymptote on each end, so the parameters are less than 1. In fact, it’s a beta(1/2, 1/2) distribution. It comes up, for example, as the Jeffreys prior for Bernoulli trials.

The graph below adds the beta(1/2, 1/2) density to the histogram to show how well it fits.

It’s an interesting bit of math and statistics, and John provides some Python demo code at the end.

Comments closed

Explaining Confidence Intervals

Mala Mahadevan explains what confidence intervals are:

Suppose I look at a sampling of 100 americans who are asked if they approve of the job the supreme court is doing. Let us say for simplicity’s sake that the only two answers possible are yes or no. Out of 100, say 40% say yes. As an ordinary person, you would think 40% of people just approve. But a deeper answer would be – the true proportion of americans who approve of the job the supreme court is doing is between x% and y%.

How confident I am that it is?  About z%. (the common math used is 95%).  That is an answer that is more reflective of the uncertainty related to questioning people and taking the answers to be what is truly reflective of an opinion. The x and y values make up what is called a ‘confidence interval’.

Read the whole thing.

Comments closed

Introduction To Bayesian Statistics

Kennie Nybo Pontoppidan has just completed a course on Bayesian statistics:

Last month I finished a four-week course on Bayesian statistics. I have always wondered why people deemed it hard, and why I heard that the computations quickly became complicated. The course wasn’t that hard, and it gave a nice introduction to prior/posterior distributions and I many cases also how to interpret the parameters in the prior distribution as extra data points.

An interesting aspect of Bayesian statistics is that it is a mathematically rigorous model, with no magic numbers such as the 5% threshold for p-values. And I like the way it naturally caters sequential hypothesis testing with where the sample size of each iteration is not fixed in advance. Instead data are evaluated and used to update the model as they are collected.

Check out Kennie’s explanation as well as the course.  I also went through Bayes’ Theorem not too long ago, which is a good introduction to the topic if you’re unfamiliar with Bayes’s Law.

Comments closed

Time-Varying Models

Lingrui Gan explains how to model for parameters whose effects change over time:

We can frame conversion prediction as a binary classification problem, with outcome “1” when the visitor converts, and outcome “0” when they do not. Suppose we build a model to predict conversion using site visitor features. Some examples of relevant features are: time of day, geographical features based on a visitor’s IP address, their device type, such as “iPhone”, and features extracted from paid ads the visitor interacted with online.

A static classification model, such as logistic regression, assumes the influence of all features is stable over time, in other words, the coefficients in the model are constants. For many applications, this assumption is reasonable—we wouldn’t expect huge variations in the effect of a visitor’s device type. In other situations, we may want to allow for coefficients that change over time—as we better optimize our paid ad channel, we expect features extracted from ad interactions to be more influential in our prediction model.

Read on for more.

Comments closed

Analyzing Clickstream Data With Markov Chains

Eleni Markou shows one method of analyzing clickstream data:

We chose to use the third-order Markov Chain on the above-produced data, as:

  • The number of parameters needed for the chain’s representation remains manageable. As the order increases, the parameters necessary for the representation increase exponentially and thus managing them requires significant computational power.
  • As a rule of thumb, we would like at least half of the clickstreams to consist of as many clicks as the order of the Markov Chain that should be fitted. There is no point in selecting a third-order chain if the majority of the clickstream consists of two states and so there is no state three steps behind to take into consideration.

Fitting the Markov Chain model gives us transition probabilities matrices and the lambda parameters of the chain for each one of the three lags, along with the start and end probabilities.

This particular analysis is trying to understand which page (if any) a user will go to next when on a particular page.  Eleni uses additional techniques like k-means clustering to segment out particular groups of users.  Very interesting analysis.

Comments closed