Category: Data Science

Survival Analysis

Published 2017-04-27 by Kevin Feasel

Joseph Rickert explains what survival analysis is and shows an example with R:

Looking at the Task View on a small screen is a bit like standing too close to a brick wall – left-right, up-down, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As well-organized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. (I would be remiss not to mention that we all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.) Newcomers, people either new to R or new to survival analysis or both, must find it overwhelming. So, it is with newcomers in mind that I offer the following slim trajectory through the task view that relies on just a few packages: survival, KMsurv, Oisurv and ranger
The survival package, which began life as an S package in the late ’90s, is the cornerstone of the entire R Survival Analysis edifice. Not only is the package itself rich in features, but the object created by the Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R.

Survival analysis is an interesting field of study. In engineering fields, the most common use is calculating mean time to failure, but that’s certainly not the only place you’re liable to see it.

Comments closed

Outliers In Histograms

Published 2017-04-27 by Kevin Feasel

Edwin Thoen has an interesting solution to a classic problem with histograms:

Two strategies that make the above into something more interpretable are taking the logarithm of the variable, or omitting the outliers. Both do not show the original distribution, however. Another way to go, is to create one bin for all the outlier values. This way we would see the original distribution where the density is the highest, while at the same time getting a feel for the number of outliers. A quick and dirty implementation of this would be

hist_data %>% mutate(x_new = ifelse(x > 10, 10, x)) %>% ggplot(aes(x_new)) + geom_histogram(binwidth = .1, col = "black", fill = "cornflowerblue")

Edwin then shows a nicer solution, so read the whole thing.

Comments closed

The Birthday Problem

Published 2017-04-26 by Kevin Feasel

Mala Mahadevan explains the Birthday problem and demonstrates it with SQL and R:

Given a room of 23 random people, what are chances that two or more of them have the same birthday?
This problem is a little different from the earlier ones, where we actually knew what the probability in each situation was.
What are chances that two people do NOT share the same birthday? Let us exclude leap years for now..chances that two people do not share the same birthday is 364/365, since one person’s birthday is already a given. In a group of 23 people, there are 253 possible pairs (23*22)/2. So the chances of no two people sharing a birthday is 364/365 multiplied 253 times. The chances of two people sharing a birthday, then, per basics of probability, is 1 – this.

The funny thing for me is that I’ve had the Birthday problem explained three separate times using as a demo the 20-30 people in the classroom. In none of those three cases was there a match, so although I understand that it is correct and how it is correct, the 100% failure to replicate led a little nagging voice in the back of my mind to discount it.

Comments closed

Statistics For Programmers

Published 2017-04-26 by Kevin Feasel

Julia Evans shares some good resources for developers interested in statistics:

even more links

a paper someone said was good (by Efron): Bootstrap Methods: another look at the jackknife
this book by 5 people named lock
this blog post has an overview of different nonparametric tests
this podcast with Philip Guo and John DeNero where they talk about teaching stats to programmers
nonparametric statistical methods
openintro has free some statistics books

There are a lot of good links in Julia’s post. I should also mention that Andrew Gelman and Deborah Nolan have a new book coming out in July. Gelman’s Bayesian approach suits me well, so I’m pre-ordering the book.

Comments closed

Building A Recommendation System With Graph Data

Published 2017-04-25 by Kevin Feasel

Arvind Shyamsundar and Shreya Verma show how to implement a recommender system using SQL Server 2017’s new graph database functionality:

As you can see from the animation, the algorithm is quite simple:
First, we identify the user and ‘current’ song to start with (red line)
Next, we identify the other users who have also listened to this song (green line)
Then we find the other songs which those other users have also listened to (blue, dotted line)
Finally, we direct the current user to the top songs from those other songs, prioritized by the number of times they were listened to (this is represented by the thick violet line.)
The algorithm above is quite simple, but as you will see it is quite effective in meeting our requirement. Now, let’s see how to actually implement this in SQL Server 2017.

Click through for animated images as well as an actual execution plan and recommendations for graph query optimization (spoilers: columnstore all the things). They also link to the GitHub project where you can try it out yourself.

Comments closed

Instantiating Cortana Intelligence Gallery Solutions

Published 2017-04-25 by Kevin Feasel

Melissa Coates has a step-by-step guide showing how to install a solution from the Cortana Intelligence Gallery:

We had no options along the way for selecting names for resources, so we have a lot of auto-generated suffixes for our resource names. This is ok for purely learning scenarios, but not my preference if we’re starting a true project with a pre-configured solution. Following an existing naming convention is impossible with solutions (at this point anyway). A wish list item I have is for the solution deployment UI to display the proposed names for each resource and let us alter if desired before the provisioning begins.
The deployment also doesn’t prompt for which subscription to deploy to (if you have multiple subscriptions like I do). The deployment did go to the subscription I wanted, however, it would be really nice to have that as a selection to make sure it’s not just luck.

It sounds like there are some undesirable defaults, but at least it does appear to be very easy to do.

Comments closed

Fitness In Modeling

Published 2017-04-24 by Kevin Feasel

Leila Etaati notes the Scylla and Charybdis of models:

However, in the most machine learning experiences, we will face two risks :Over fitting and under fitting.
I will explain these two concepts via an example below.
imagine that we have collected information about the number of coffees that have been purchased in a café from 8am to 5pm.

Overfitting tends to be a bigger problem in my experience, but they’re both dangerous.

Comments closed

Logistic Regression In R

Published 2017-04-21 by Kevin Feasel

Steph Locke has a presentation on performing logistic regression using R:

Logistic regressions are a great tool for predicting outcomes that are categorical. They use a transformation function based on probability to perform a linear regression. This makes them easy to interpret and implement in other systems.
Logistic regressions can be used to perform a classification for things like determining whether someone needs to go for a biopsy. They can also be used for a more nuanced view by using the probabilities of an outcome for thinks like prioritising interventions based on likelihood to default on a loan.

It’s a good introduction to an important statistical method.

Comments closed

Understanding The Problem: Churn Edition

Published 2017-04-18 by Kevin Feasel

Emre Yazici points out the importance and difficulty of nailing down good definitions, using bookings churn as an example:

WHEN: Let’s say, we have made our design, constructed a model and obtained a good accuracy. However our model predicts (even with 95% accuracy) the customer who are going to churn in next day! That means our business department have to prevent (somehow, as explained before) those customers to churn in “one day”. Because next day, they will not be our customers. Taking an action to “3000” customers (let’s say) in one day only is impossible. So even our project predicts with very high accuracy, it will not be usefull. This approach also creates another problem: Consider that N months ago, a customer “A” was a happy customer and was working (providing us) with us (let’s say, it is a customer with %100 efficiency – happiness) and tomorrow it will be a customer who is not working with us (a customer with %100 efficiency – happiness). And we can predict the result today. So most probably, the customer has already got the idea to leave from our company in the last day. This is a deadend and we can not prevent the customer to churn at this point – because it is already too late.
So we need to have a certain time limit… Such that we need to be able to warn the business department “M months” before (customer churn) thus they can take action before the customers leave. Here comes another problem, what is the time limit… 2 months, 2.5 months, 3 months…? How do we determine the time, that we need to predict customers churn before (they leave)?

There’s a lot more to a good solution than “I ran a regression against a data set.”

Comments closed

When Binomials Converge

Published 2017-04-18 by Kevin Feasel

Mala Mahadevan shows an example of the central limit theorem in action, as a large enough sample from a binomial distribution approximates the normal:

An easier way to do it is to use the normal distribution, or central limit theorem. My post on the theorem illustrates that a sample will follow normal distribution if the sample size is large enough. We will use that as well as the rules around determining probabilities in a normal distribution, to arrive at the probability in this case.
Problem: I have a group of 100 friends who are smokers. The probability of a random smoker having lung disease is 0.3. What are chances that a maximum of 35 people wind up with lung disease?

Click through for the example.

Comments closed