Category: Data Science

Outliers In Histograms

Published 2017-04-27 by Kevin Feasel

Edwin Thoen has an interesting solution to a classic problem with histograms:

Two strategies that make the above into something more interpretable are taking the logarithm of the variable, or omitting the outliers. Both do not show the original distribution, however. Another way to go, is to create one bin for all the outlier values. This way we would see the original distribution where the density is the highest, while at the same time getting a feel for the number of outliers. A quick and dirty implementation of this would be

hist_data %>% 
  mutate(x_new = ifelse(x > 10, 10, x)) %>% 
  ggplot(aes(x_new)) +
  geom_histogram(binwidth = .1, col = "black", fill = "cornflowerblue")

Edwin then shows a nicer solution, so read the whole thing.

Comments closed

The Birthday Problem

Published 2017-04-26 by Kevin Feasel

Mala Mahadevan explains the Birthday problem and demonstrates it with SQL and R:

Given a room of 23 random people, what are chances that two or more of them have the same birthday?

This problem is a little different from the earlier ones, where we actually knew what the probability in each situation was.

What are chances that two people do NOT share the same birthday? Let us exclude leap years for now..chances that two people do not share the same birthday is 364/365, since one person’s birthday is already a given. In a group of 23 people, there are 253 possible pairs (23*22)/2. So the chances of no two people sharing a birthday is 364/365 multiplied 253 times. The chances of two people sharing a birthday, then, per basics of probability, is 1 – this.

The funny thing for me is that I’ve had the Birthday problem explained three separate times using as a demo the 20-30 people in the classroom. In none of those three cases was there a match, so although I understand that it is correct and how it is correct, the 100% failure to replicate led a little nagging voice in the back of my mind to discount it.

Comments closed

Statistics For Programmers

Published 2017-04-26 by Kevin Feasel

Julia Evans shares some good resources for developers interested in statistics:

even more links

a paper someone said was good (by Efron): Bootstrap Methods: another look at the jackknife
this book by 5 people named lock
this blog post has an overview of different nonparametric tests
this podcast with Philip Guo and John DeNero where they talk about teaching stats to programmers
nonparametric statistical methods
openintro has free some statistics books

There are a lot of good links in Julia’s post. I should also mention that Andrew Gelman and Deborah Nolan have a new book coming out in July. Gelman’s Bayesian approach suits me well, so I’m pre-ordering the book.

Comments closed

Building A Recommendation System With Graph Data

Published 2017-04-25 by Kevin Feasel

Arvind Shyamsundar and Shreya Verma show how to implement a recommender system using SQL Server 2017’s new graph database functionality:

As you can see from the animation, the algorithm is quite simple:

First, we identify the user and ‘current’ song to start with (red line)

Next, we identify the other users who have also listened to this song (green line)

Then we find the other songs which those other users have also listened to (blue, dotted line)

Finally, we direct the current user to the top songs from those other songs, prioritized by the number of times they were listened to (this is represented by the thick violet line.)

The algorithm above is quite simple, but as you will see it is quite effective in meeting our requirement. Now, let’s see how to actually implement this in SQL Server 2017.

Click through for animated images as well as an actual execution plan and recommendations for graph query optimization (spoilers: columnstore all the things). They also link to the GitHub project where you can try it out yourself.

Comments closed

Instantiating Cortana Intelligence Gallery Solutions

Published 2017-04-25 by Kevin Feasel

Melissa Coates has a step-by-step guide showing how to install a solution from the Cortana Intelligence Gallery:

We had no options along the way for selecting names for resources, so we have a lot of auto-generated suffixes for our resource names. This is ok for purely learning scenarios, but not my preference if we’re starting a true project with a pre-configured solution. Following an existing naming convention is impossible with solutions (at this point anyway). A wish list item I have is for the solution deployment UI to display the proposed names for each resource and let us alter if desired before the provisioning begins.

The deployment also doesn’t prompt for which subscription to deploy to (if you have multiple subscriptions like I do). The deployment did go to the subscription I wanted, however, it would be really nice to have that as a selection to make sure it’s not just luck.

It sounds like there are some undesirable defaults, but at least it does appear to be very easy to do.

Comments closed

Fitness In Modeling

Published 2017-04-24 by Kevin Feasel

Leila Etaati notes the Scylla and Charybdis of models:

However, in the most machine learning experiences, we will face two risks :Over fitting and under fitting.
I will explain these two concepts via an example below.
imagine that we have collected information about the number of coffees that have been purchased in a café from 8am to 5pm.

Overfitting tends to be a bigger problem in my experience, but they’re both dangerous.

Comments closed

Logistic Regression In R

Published 2017-04-21 by Kevin Feasel

Steph Locke has a presentation on performing logistic regression using R:

Logistic regressions are a great tool for predicting outcomes that are categorical. They use a transformation function based on probability to perform a linear regression. This makes them easy to interpret and implement in other systems.

Logistic regressions can be used to perform a classification for things like determining whether someone needs to go for a biopsy. They can also be used for a more nuanced view by using the probabilities of an outcome for thinks like prioritising interventions based on likelihood to default on a loan.

It’s a good introduction to an important statistical method.

Comments closed

Understanding The Problem: Churn Edition

Published 2017-04-18 by Kevin Feasel

Emre Yazici points out the importance and difficulty of nailing down good definitions, using bookings churn as an example:

WHEN: Let’s say, we have made our design, constructed a model and obtained a good accuracy. However our model predicts (even with 95% accuracy) the customer who are going to churn in next day! That means our business department have to prevent (somehow, as explained before) those customers to churn in “one day”. Because next day, they will not be our customers. Taking an action to “3000” customers (let’s say) in one day only is impossible. So even our project predicts with very high accuracy, it will not be usefull. This approach also creates another problem: Consider that N months ago, a customer “A” was a happy customer and was working (providing us) with us (let’s say, it is a customer with %100 efficiency – happiness) and tomorrow it will be a customer who is not working with us (a customer with %100 efficiency – happiness). And we can predict the result today. So most probably, the customer has already got the idea to leave from our company in the last day. This is a deadend and we can not prevent the customer to churn at this point – because it is already too late.

So we need to have a certain time limit… Such that we need to be able to warn the business department “M months” before (customer churn) thus they can take action before the customers leave. Here comes another problem, what is the time limit… 2 months, 2.5 months, 3 months…? How do we determine the time, that we need to predict customers churn before (they leave)?

There’s a lot more to a good solution than “I ran a regression against a data set.”

Comments closed

When Binomials Converge

Published 2017-04-18 by Kevin Feasel

Mala Mahadevan shows an example of the central limit theorem in action, as a large enough sample from a binomial distribution approximates the normal:

An easier way to do it is to use the normal distribution, or central limit theorem. My post on the theorem illustrates that a sample will follow normal distribution if the sample size is large enough. We will use that as well as the rules around determining probabilities in a normal distribution, to arrive at the probability in this case.
Problem: I have a group of 100 friends who are smokers. The probability of a random smoker having lung disease is 0.3. What are chances that a maximum of 35 people wind up with lung disease?

Click through for the example.

Comments closed

Logistic Regression With R

Published 2017-04-17 by Kevin Feasel

Raghavan Madabusi runs through a sample logistic regression:

Input Variables: These variables are called as predictors or independent variables.

Customer Demographics (Gender and Senior citizenship)

Billing Information (Monthly and Annual charges, Payment method)

Product Services (Multiple line, Online security, Streaming TV, Streaming Movies, and so on)

Customer relationship variables (Tenure and Contract period)

Output Variables: These variables are called as response or dependent variables. Since the output variable (Churn value) takes the binary form as “0” or “1”, it will be categorized under classification problem in the supervised machine learning.

One of the interesting things in this post was the use of missmap, which is part of Amelia.

Comments closed