# Category: Data Science

Suppose I look at a sampling of 100 americans who are asked if they approve of the job the supreme court is doing. Let us say for simplicity’s sake that the only two answers possible are yes or no. Out of 100, say 40% say yes. As an ordinary person, you would think 40% of people just approve. But a deeper answer would be – the true proportion of americans who approve of the job the supreme court is doing is between x% and y%.

How confident I am that it is?  About z%. (the common math used is 95%).  That is an answer that is more reflective of the uncertainty related to questioning people and taking the answers to be what is truly reflective of an opinion. The x and y values make up what is called a ‘confidence interval’.

Last month I finished a four-week course on Bayesian statistics. I have always wondered why people deemed it hard, and why I heard that the computations quickly became complicated. The course wasn’t that hard, and it gave a nice introduction to prior/posterior distributions and I many cases also how to interpret the parameters in the prior distribution as extra data points.

An interesting aspect of Bayesian statistics is that it is a mathematically rigorous model, with no magic numbers such as the 5% threshold for p-values. And I like the way it naturally caters sequential hypothesis testing with where the sample size of each iteration is not fixed in advance. Instead data are evaluated and used to update the model as they are collected.

Check out Kennie’s explanation as well as the course.  I also went through Bayes’ Theorem not too long ago, which is a good introduction to the topic if you’re unfamiliar with Bayes’s Law.

We can frame conversion prediction as a binary classification problem, with outcome “1” when the visitor converts, and outcome “0” when they do not. Suppose we build a model to predict conversion using site visitor features. Some examples of relevant features are: time of day, geographical features based on a visitor’s IP address, their device type, such as “iPhone”, and features extracted from paid ads the visitor interacted with online.

A static classification model, such as logistic regression, assumes the influence of all features is stable over time, in other words, the coefficients in the model are constants. For many applications, this assumption is reasonable—we wouldn’t expect huge variations in the effect of a visitor’s device type. In other situations, we may want to allow for coefficients that change over time—as we better optimize our paid ad channel, we expect features extracted from ad interactions to be more influential in our prediction model.

We chose to use the third-order Markov Chain on the above-produced data, as:

• The number of parameters needed for the chain’s representation remains manageable. As the order increases, the parameters necessary for the representation increase exponentially and thus managing them requires significant computational power.
• As a rule of thumb, we would like at least half of the clickstreams to consist of as many clicks as the order of the Markov Chain that should be fitted. There is no point in selecting a third-order chain if the majority of the clickstream consists of two states and so there is no state three steps behind to take into consideration.

Fitting the Markov Chain model gives us transition probabilities matrices and the lambda parameters of the chain for each one of the three lags, along with the start and end probabilities.

This particular analysis is trying to understand which page (if any) a user will go to next when on a particular page.  Eleni uses additional techniques like k-means clustering to segment out particular groups of users.  Very interesting analysis.

Survival analysis is a set of methods to analyze the ‘time to occurrence’ of an event. The response is often referred to as a failure time, survival time, or event time. These methods are widely used in clinical experiments to analyze the ‘time to death’, but nowadays these methods are being used to predict the ‘when’ and ‘why’ of customer churn or employee turnover as well.

The dependent variables for the analysis are generally two functions:

Read the whole thing.  H/T R-Bloggers

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.
Linear classifier defined that the training data should be classified into corresponding categories i.e. if we are applying classification for the 2 categories then all the training data must be lie in these two categories.
Binary classifier defines that there should be only 2 categories for classification.
Hence, The basic Perceptron algorithm is used for binary classification and all the training example should lie in these categories. The basic unit in the Neuron is called the Perceptron.

The prime advantage of a 2d-tree over a BST is that it supports efficient implementation of range search and nearest-neighbor search. Each node corresponds to an axis-aligned rectangle, which encloses all of the points in its subtree. The root corresponds to the entire plane [(−∞, −∞), (+∞, +∞ )]; the left and right children of the root correspond to the two rectangles split by the x-coordinate of the point at the root; and so forth.

• Range search: To find all points contained in a given query rectangle, start at the root and recursively search for points in both subtrees using the following pruning rule: if the query rectangle does not intersect the rectangle corresponding to a node, there is no need to explore that node (or its subtrees). That is, search a subtree only if it might contain a point contained in the query rectangle.

• Nearest-neighbor search: To find a closest point to a given query point, start at the root and recursively search in both subtrees using the following pruning rule: if the closest point discovered so far is closer than the distance between the query point and the rectangle corresponding to a node, there is no need to explore that node (or its subtrees). That is, search a node only if it might contain a point that is closer than the best one found so far. The effectiveness of the pruning rule depends on quickly finding a nearby point. To do this, organize the recursive method so that when there are two possible subtrees to go down, you choose first the subtree that is on the same side of the splitting line as the query point; the closest point found while exploring the first subtree may enable pruning of the second subtree.

• k-nearest neighbors search: This method returns the k points that are closest to the query point (in any order); return all n points in the data structure if n ≤ k. It must do this in an efficient manner, i.e. using the technique from kd-tree nearest neighbor search, not from brute force.

Sandipan implements a fairly classic problem in this space:  the behavior of a group of flocking birds.

### STAGE 1.  Determine whether PCA is appropriate at all, considering the variables

• Variables should be inter-correlated enough but not too much. Field et al. (2012) provide some thresholds, suggesting that no variable should have many correlations below .30, or any correlation at all above .90. Thus, in the example here, variable Q06 should probably be excluded from the PCA.

• Bartlett’s test, on the nature of the intercorrelations, should be significant. Significance suggests that the variables are not an ‘identity matrix’ in which correlations are a sampling error.

• KMO (Kaiser-Meyer-Olkin), a measure of sampling adequacy based on common variance (so similar purpose as Bartlett’s). As Field et al. review, ‘values between .5 and .7 are mediocre, values between .7 and .8 are good, values between .8 and .9 are great and values above .9 are superb’ (p. 761). There’s a general score as well as one per variable. The general one will often be good, whereas the individual scores may more likely fail. Any variable with a score below .5 should probably be removed, and the test should be run again.

• Determinant: A formula about multicollinearity. The result should preferably fall below .00001.

PCA is a powerful tool in several fields, including clinical testing.

The many things I happen to have issues with in common mis-use of `p`-values include:

1. `p`-hacking. This includes censored data bias, repeated measurement bias, and even outright fraud.

2. “Statsmanship” (the deliberate use of statistical terminology for obscurity, not for clarity). For example: saying `p` instead of saying what you are testing such as “significance of a null hypothesis”.

3. Logical fallacies. This is the (false) claim that `p` being low implies that the probability that your model is good is high. At best a low-`p` eliminates a null hypothesis (or even a family of them). But saying such disproof “proves something” is just saying “the butler did it” because you find the cook innocent (a simple case of a fallacy of an excluded middle).

4. Confusion of population and individual statistics. This is the use of deviation of sample means (which typically decreases as sample size goes up) when deviation of individual differences (which typically does not decrease as sample size goes up) is what is appropriate . This is one of the biggest scams in data science and marketing science: showing that you are good at predicting aggregate (say, the mean number of traffic deaths in the next week in a large city) and claiming this means your model is good at predicting per-individual risk. Some of this comes from the usual statistical word games: saying “standard error” (instead of “standard error of the mean or population”) and “standard deviation” (“instead of standard deviation of individual cases”); with some luck somebody won’t remember which is which and be too afraid to ask.

Even if you know what p-values are, this is definitely worth reading, as it’s so easy to misuse p-values (even when I’m not on my Bayesian post hurling tomatoes at frequentists).

### Causes of Overfitting

There are two major situations that could cause overfitting in DTrees:

1. Overfitting Due to Presence of Noise – Mislabeled instances may contradict the class labels of other similar records.
2. Overfitting Due to Lack of Representative Instances – Lack of representative instances in the training data can prevent refinement of the learning algorithm.

A good model must not only fit the training data well
but also accurately classify records it has never seen.

### How to avoid overfitting?

There are 2 major approaches to avoid overfitting in DTrees.

1. approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data.

2. approaches that allow the tree to overfit the data, and then post-prune the tree.

Click through for more details on these two approaches.