Press "Enter" to skip to content

Category: Data Science

Fraud Detection With Python

Kevin Jacobs has a walkthrough of how to use Pandas and scikit-learn to perform fraud detection against a sample set of credit card transactions:

Apparently, the data consists of 28 variables (V1, …, V28), an “Amount” field a “Class” field and the “Time” field. We do not know the exact meanings of the variables (due to privacy concerns). The Class field takes values 0 (when the transaction is not fraudulent) and value 1 (when a transaction is fraudulent). The data is unbalanced: the number of non-fraudulent transactions (where Class equals 0) is way more than the number of fraudulent transactions (where Class equals 1). Furthermore, there is a Time field. Further inspection shows that these are integers, starting from 0.

There is a small trick for getting more information than only the raw records. We can use the following code:


This code will give a statistically summary of all the columns. It shows for example that the Amount field ranges between 0.00 and 25691.16. Thus, there are no negative transactions in the data.

The Kaggle competition data set is available, so you can follow along.

Comments closed

Forcing 0 Intercept Inflates R-squared In R

John Mount has an informative post on how you can trick yourself when running linear regression models in R and forcing the y intercept to be 0:

So far so good. Let’s now remove the “intercept term” by adding the “0+” from the fitting command.

m2 <- lm(y~0+x, data=d)t(broom::glance(m2))
## [,1]
## r.squared 7.524811e-01
## adj.r.squared 7.474297e-01
## sigma 3.028515e-01
## statistic 1.489647e+02
## p.value 1.935559e-30
## df 2.000000e+00
## logLik -2.143244e+01
## AIC 4.886488e+01
## BIC 5.668039e+01
## deviance 8.988464e+00
## df.residual 9.800000e+01
d$pred2 <- predict(m2, newdata = d)

Uh oh. That appeared to vastly improve the reported R-squared and the significance (“p.value“)!

Read on to learn why this happens and how you can prevent this from tricking you in the future.

Comments closed

Facial Recognition With Amazon Rekognition

Chris Adzima describes how his law enforcement agency uses Amazon Rekognition for facial recognition:

Setup was fairly straightforward. In the Washington County jail management system (JMS), we have an archive of mugshots going back to 2001. We needed to get the mugshots (all 300,000 of them) into Amazon S3. Then we need to index them all in Amazon Rekognition, which took about 3 days.

Our JMS allows us to tag the shots with the following information: front view or side view, scars, marks, or tattoos. We only wanted the front view, so we used those tags to get a list of just those.

Read on for sample implementation details, including moving images to S3, building the facial recognition “database,” and using it.

Comments closed

Cochran-Mantel-Haenszel Test

Mala Mahadevan explains the Cochran-Mantel-Haenszel test, with two parts up so far.  First, her data set:

Below is the script to create the table and dataset I used. This is just test data and not copied from anywhere.

Second, an introduction to the test itself and solutions in R and T-SQL:

This test is an extension of the Chi Square test I blogged of earlier. This is applied when we have to compare two groups over several levels and comparison may involve a third variable.
Let us consider a cohort study as an example – we have two medications A and B to treat asthma. We test them on a randomly selected batch of 200 people. Half of them receive drug A and half of them receive drug B. Some of them in either half develop asthma and some have it under control. The data set I have used can be found here. The summarized results are as below.

This series is not yet complete, so stay tuned.

Comments closed

Quantile Regression With Python

Gopi Subramanian discusses one of my favorite regression concepts, heteroskedasticity:

With variance score of 0.43 linear regression did not do a good job overall. When the x values are close to 0, linear regression is giving a good estimate of y, but we near end of x values the predicted y is far way from the actual values and hence becomes completely meaningless.

Here is where Quantile Regression comes to rescue. I have used the python package statsmodels 0.8.0 for Quantile Regression.

Let us begin with finding the regression coefficients for the conditioned median, 0.5 quantile.

The article doesn’t render the code very well at all, but Gopi does have the example code on Github, so you can follow along that way.

Comments closed

Sentiment Analysis In R

Stefan Feuerriegel and Nicolas Pröllochs have a new package in CRAN:

Our package “SentimentAnalysis” performs a sentiment analysis of textual contents in R. This implementation utilizes various existing dictionaries, such as QDAP or Loughran-McDonald. Furthermore, it can also create customized dictionaries. The latter uses LASSO regularization as a statistical approach to select relevant terms based on an exogenous response variable.

I’m not sure how it stacks up to external services, but it’s another option available to us.

Comments closed

Understanding Random Forests

Manish Kumar Barnwal explains how random forest algorithms work:

Say our dataset has 1,000 rows and 30 columns. There are two levels of randomness in this algorithm:

  • At row level: Each of these decision trees gets a random sample of the training data (say 10%) i.e. each of these trees will be trained independently on 100 randomly chosen rows out of 1,000 rows of data. Keep in mind that each of these decision trees is getting trained on 100 randomly chosen rows from the dataset i.e they are different from each other in terms of predictions.
  • At column level: The second level of randomness is introduced at the column level. Not all the columns are passed into training each of the decision trees. Say we want only 10% of columns to be sent to each tree. This means a randomly selected 3 column will be sent to each tree. So for the first decision tree, may be column C1, C2 and C4 were chosen. The next DT will have C4, C5, C10 as chosen columns and so on.

This  is a nice article and includes cases when not to use random forests.

Comments closed

Multi-Shot Games

Dan Goldstein explains a counter-intuitive probability exercise:

Peter Ayton is giving a talk today at the London Judgement and Decision Making Seminar

Imagine being obliged to play Russian roulette – twice (if you are lucky enough to survive the first game). Each time you must spin the chambers of a six-chambered revolver before pulling the trigger. However you do have one choice: You can choose to either (a) use a revolver which contains only 2 bullets or (b) blindly pick one of two other revolvers: one revolver contains 3 bullets; the other just 1 bullet. Whichever particular gun you pick you must use every time you play. Surprisingly, option (b) offers a better chance of survival

My recommendation is to avoid playing Russian roulette.

Comments closed

Visualizing Emergency Room Visits

Eugene Joh has a great blog post showing how to parse ICD-9 codes using regular expressions and then visualize the results as a treemap:

It looks like there is a header/title at [1], numeric grouping  at [2] “1.\tINFECTIOUS AND PARASITIC DISEASES”,  subgrouping by ICD-9 code ranges, at [3] “Intestinal infectious diseases (001-009)” and then 3-digit ICD-9 codes followed by a specific diagnosis, at [10] “007\tOther protozoal intestinal diseases”. At the end we want to produce three separate data frames that we’ll categorize as:

  1. Groups: the title which contains the general diagnosis grouping

  2. Subgroups: the range of ICD-9 codes that contain a certain diagnosis subgroup

  3. Classification: the specific 3-digit ICD-9 code that corresponds with a diagnosis

It’s a beefy article full of insight.

Comments closed

Fisher’s Exact Test

Mala Mahadevan explains Fisher’s Exact Test and provides examples in T-SQL and R:

The decision rule in two sample tests of hypothesis depends on three factors :
1 Whether the test is upper, lower or two tailed (meaning the comparison is greater, lesser or both sides of gender and speaker count)
2 The level of significance or degree of accuracy needed,
3 The form of test statistic.
Our test here is to just find out if gender and speaker count are related so it is a two tailed test. The level of significance we can use is the most commonly used 95% which is also the default in R for Fischer’s Test. The form of the test statistic is P value. So our decision rule would be that gender and speaker category are related if P value is less than 0.05.

Click through for the R code followed by a code sample which should explain why you don’t want to do it in T-SQL.

1 Comment