Fraud Detection With Python

Kevin Jacobs has a walkthrough of how to use Pandas and scikit-learn to perform fraud detection against a sample set of credit card transactions:

Apparently, the data consists of 28 variables (V1, …, V28), an “Amount” field a “Class” field and the “Time” field. We do not know the exact meanings of the variables (due to privacy concerns). The Class field takes values 0 (when the transaction is not fraudulent) and value 1 (when a transaction is fraudulent). The data is unbalanced: the number of non-fraudulent transactions (where Class equals 0) is way more than the number of fraudulent transactions (where Class equals 1). Furthermore, there is a Time field. Further inspection shows that these are integers, starting from 0.

There is a small trick for getting more information than only the raw records. We can use the following code:


This code will give a statistically summary of all the columns. It shows for example that the Amount field ranges between 0.00 and 25691.16. Thus, there are no negative transactions in the data.

The Kaggle competition data set is available, so you can follow along.

Related Posts

K Nearest Cliques

Vincent Granville explains an algorithm built around finding cliques of data points: The cliques considered here are defined by circles (in two dimensions) or spheres (in three dimensions.) In the most basic version, we have one clique for each cluster, and the clique is defined as the smallest circle containing a pre-specified proportion p of the points […]

Read More

Building An Image Recognizer With R

David Smith has a post showing how to build an image recognizer with R and Microsoft’s Cognitive Services Library: The process of training an image recognition system requires LOTS of images — millions and millions of them. The process involves feeding those images into a deep neural network, and during that process the network generates […]

Read More


June 2017
« May Jul »