Unintentional Data

Eric Hollingsworth describes data science as the cost of collecting data approaches zero:

Thankfully not only have modern data analysis tools made data collection cheap and easy, they have made the process of exploratory data analysis cheaper and easier as well. Yet when we use these tools to explore data and look for anomalies or interesting features, we are implicitly formulating and testing hypotheses after we have observed the outcomes. The ease with which we are now able to collect and explore data makes it very difficult to put into practice even basic concepts of data analysis that we have learned — things such as:

  • Correlation does not imply causation.
  • When we segment our data into subpopulations by characteristics of interest, members are not randomly assigned (rather, they are chosen deliberately) and suffer from selection bias.
  • We must correct for multiple hypothesis tests.
  • We ought not dredge our data.

All of those principles are well known to statisticians, and have been so for many decades. What is newer is just how cheap it is to posit hypotheses. For better and for worse, technology has led to a democratization of data within organizations. More people than ever are using statistical analysis packages and dashboards, explicitly or more often implicitly, to develop and test hypotheses.

This is a thoughtful essay well worth reading.

Measuring Semantic Relatedness

Sandipan Dey re-works a university assignment on semantic relatedness in Python:

Let’s define the semantic relatedness of two WordNet nouns x and y as follows:

  • A = set of synsets in which x appears
  • B = set of synsets in which y appears
  • distance(x, y) = length of shortest ancestral path of subsets A and B
  • sca(x, y) = a shortest common ancestor of subsets A and B

This is the notion of distance that we need to use to implement the distance() and sca() methods in the WordNet data type.

It looks like a helpful assignment for understanding natural language processing a little better.

Linear Discriminant Analysis

Jake Hoare explains Linear Discriminant Analysis:

Linear Discriminant Analysis takes a data set of cases (also known as observations) as input. For each case, you need to have a categorical variable to define the class and several predictor variables (which are numeric). We often visualize this input data as a matrix, such as shown below, with each case being a row and each variable a column. In this example, the categorical variable is called “class” and the predictive variables (which are numeric) are the other columns.

Following this is a clear example of how to use LDA.  This post is also the second time this week somebody has suggested The Elements of Statistical Learning, so I probably should make time to look at the book.

Bayesian Nonparametric Models

Luba Belokon asked Vadim Smolyakov to explain Bayesian Nonparametric models and here’s the result:

Bayesian Nonparametrics are a class of models for which the number of parameters grows with data. A simple example is non-parametric K-means clustering [1]. Instead of fixing the number of clusters K, we let data determine the best number of clusters. By letting the number of model parameters (cluster means and covariances) grow with data, we are better able to describe the data as well as generate new data given our model.

Of course, to avoid over-fitting, we penalize the number of clusters K via a regularization parameter which controls the rate at which new clusters are created. 

This is an interesting discussion of the Dirichlet process, particularly as applied to K-mean clustering.  It helps you figure out your best choice for K, no small task.

Linear Regression With Deducer

Sunil Kappal demonstrates how to use Deducer, a GUI for R, to perform a simple linear regression:

Selecting the variables in the Deducer GUI:

  • Outcome variable: Y, or the dependent variable, should be put on this list

  • As numeric: Independent variables that should be treated as covariates should be put in this section. Deducer automatically converts a factor into a numeric variable, so make sure that the order of the factor level is correct

  • As factor: Categorically independent variables (language, ethnicity, etc.).

  • Weights: This option allows the users to apply sampling weights to the regression model.

  • Subset: Helps to define if the analysis needs to be done within a subset of the whole dataset.

Deducer is open source and looks like a pretty decent way of seeing what’s available to you in R.

R And Python: Two Growing Languages

David Smith notes that as fast as Python is growing, R is as well:

Python has been getting some attention recently for its impressive growth in usage. Since both R and Python are used for data science, I sometimes get asked if R is falling by the wayside, or if R developers should switch course and learn Python. My answer to both questions is no.

First, while Python is an excellent general-purpose data science tool, for applications where comparative inference and robust predictions are the main goal, R will continue to be the prime repository of validated statistical functions and cutting-edge research for a long time to come. Secondly, R and Python are both top-10 programming languages, and while Python has a larger userbase, R and Python are both growing rapidly — and at similar rates.

I had a discussion about this last night.  I like the language diversity:  R is more statistician-oriented, whereas Python is more developer-oriented.  They both can solve the same set of problems, but there are certainly cases where one beats the other.  I think Python will end up being the more popular language for data science because of the number of application developers moving into the space, but for the data analysts and academicians moving to this field, R will likely remain the more interesting language.


Mala Mahadevan explains what ANOVA is and why it’s interesting:

ANOVA – or analysis of variance, is a term given to a set of statistical models that are used to analyze differences among groups and if the differences are statistically significant to arrive at any conclusion. The models were developed by statistician and evolutionary biologist Ronald Fischer. To give a very simplistic definition – ANOVA is an extension of the two way T-Test to multiple cases.

ANOVA is an older test and a fairly simple process, but is quite useful to understand.

Neural Nets Optimizing For Imperfect

John Cook describes a paradox with neural nets:

Deep neural networks have enough parameters to overfit the data, but there are various strategies to keep this from happening. A common way to avoid overfitting is to deliberately do a mediocre job of fitting the model.

When it works well, the shortcomings of the optimization procedure yield a solution that differs from the optimal solution in a beneficial way. But the solution could fail to be useful in several ways. It might be too far from optimal, or deviate from the optimal solution in an unhelpful way, or the optimization method might accidentally do too good a job.

Conceptually, this feels a little weird but isn’t really much of a problem, as we have other analogues:  rational ignorance in economics (where we knowingly choose not to know something because the benefit is not worth the opportunity cost of learning), OPTIMIZE FOR UNKNOWN with SQL Server (where we knowingly do not use the passed-in parameter because we might get stuck in a lesser path), etc.  But the specific process here is interesting.

Imbalanced Data In R

Rathnadevi Manivannan explains how to deal with imbalanced data using R:

Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Imbalanced classification occurs more frequently in binary classification than in multi-level classification. For example, extreme imbalanced data can be seen in banking or financial data where majority credit card uses are acceptable and very few credit card uses are fraudulent.

With an imbalanced dataset, the information required to make an accurate prediction about the minority class cannot be obtained using an algorithm. So, it is recommended to use balanced classification dataset.

Rathnadevi uses fraudulent transactions for his sample, but medical diagnoses is also a good example:  suppose 1 person in 10,000 has a particular disease.  You’re 99.99% right if you just say nobody has the disease, but that’s a rather unhelpful model.

Sentiment Analysis In R

Rachel Tatman has a great tutorial introducing sentiment analysis in R:

By the end of this tutorial you will:

  • Understand what sentiment analysis is and how it works
  • Read text from a dataset & tokenize it
  • Use a sentiment lexicon to analyze the sentiment of texts
  • Visualize the sentiment of text

If you’re the hands-on type, you might want to head directly to the notebook for this tutorial. You can fork it and have your very own version of the code to run, modify and experiment with as we go along.

Check it out.  There’s a lot more to sentiment analysis—cleaning and tokenizing words, getting context right, etc.—but this is a very nice introduction.


October 2017
« Sep