Category: Data Science

Handling Imbalanced Data

Published 2017-11-20 by Kevin Feasel

Tom Fawcett shows us how to handle a tricky classification problem:

The primary problem is that these classes are imbalanced: the red points are greatly outnumbered by the blue.

Research on imbalanced classes often considers imbalanced to mean a minority class of 10% to 20%. In reality, datasets can get far more imbalanced than this. Here are some examples:

About 2% of credit card accounts are defrauded per year. (Most fraud detection domains are heavily imbalanced.)

Medical screening for a condition is usually performed on a large population of people without the condition, to detect a small minority with it (e.g., HIV prevalence in the USA is ~0.4%).

Disk drive failures are approximately ~1% per year.

The conversion rates of online ads has been estimated to lie between 10^-3 to 10^-6.

Factory production defect rates typically run about 0.1%.

Many of these domains are imbalanced because they are what I call needle in a haystackproblems, where machine learning classifiers are used to sort through huge populations of negative (uninteresting) cases to find the small number of positive (interesting, alarm-worthy) cases.

Read on for some good advice on how to handle imbalanced data.

Comments closed

Interpreting P-Value Histograms

Published 2017-11-15 by Kevin Feasel

David Robinson visualizes and interprets different p-value histograms:

So you’re a scientist or data analyst, and you have a little experience interpreting p-values from statistical tests. But then you come across a case where you have hundreds, thousands, or even millions of p-values. Perhaps you ran a statistical test on each gene in an organism, or on demographics within each of hundreds of counties. You might have heard about the dangers of multiple hypothesis testing before. What’s the first thing you do?

Make a histogram of your p-values. Do this before you perform multiple hypothesis test correction, false discovery rate control, or any other means of interpreting your many p-values. Unfortunately, for some reason, this basic and simple task rarely gets recommended (for instance, the Wikipedia page on the multiple comparisons problem never once mentions this approach). This graph lets you get an immediate sense of how your test behaved across all your hypotheses, and immediately diagnose some potential problems. Here, I’ll walk you through a basic example of interpreting a p-value histogram.

It’s a fun read and informative as well.

Comments closed

The Magic Of Sampling

Published 2017-11-13 by Kevin Feasel

Nathan LeClaire reminds us of an important story that statisticians have been telling us for a couple centuries:

It starts slowly. Maybe your home-grown centralized logging cluster becomes more difficult to operate, demanding unholy amounts of engineer time every week. Maybe engineers start to find that making a query about production is a “go get a coffee and come back later” activity. Or maybe monitoring vendors offer you a quote that elicits a response ranging anywhere from curses under the breath to blood-curdling screams of terror.

The multi-headed beast we know as Scale has reared its ugly visage.

As some of you may have already guessed from the title, I’m going to discuss one way to solve this problem, and why it might not be as bad as you might think.

Take some of your precious information and throw it in the garbage. In lots of cases, you can just drop those writes on the floor as long as your observability stack is equipped to handle it.

In other words, sample.

Read on for a couple of methods. One thing I’ve taken a fancy to is collecting the first N of a particular type of message and keeping track of how often that message appears. If you get the same error for every row in a file, then you might really only need to see that one time and the number of times it happened. Or maybe you want to see a few of them to ensure that they’re really the same error and not two separate errors which are getting reported together due to insufficient error separation.

Comments closed

Quickly Computing Area Under The Curve

Published 2017-11-13 by Kevin Feasel

Jean-Francois Puget has a fast method for computing Area Under the Curve in Python:

When the target only takes two values we have a binary classification problem at hand. Example of binary classification are very common. For instance fraud detection where examples are credit card transactions, features are time, location, amount, merchant id, etc., and target is fraud or not fraud. Spam detection is also a binary classification where examples are emails, features are the email content as a string of words, and target is spam or not spam. Without loss of generality we can assume that the target values are 0 and 1, for instance 0 means no fraud or no spam, whiloe 1 means fraud or spam.

For binary classification, predictions are also binary. Therefore, a prediction is either equal to the target, or is off the mark. A simple way to evaluate model performance is accuracy: how many predictions are right? For instance, if our test set has 100 examples in it, how many times is the prediction correct? Accuracy seems a logical way to evaluate performance: a higher accuracy obviously means a better model. At least this is what people think when they are exposed to the first time to binary classification problems. Issue is that accuracy can be extremely misleading.

Read Jean-Francois’ explanation and scroll down for the Python sample.

Comments closed

TensorFlow Tutorial

Published 2017-11-10 by Kevin Feasel

Ashish Bakshi has a TensorFlow tutorial:

As shown in the image above, tensors are just multidimensional arrays, that allows you to represent data having higher dimensions. In general, Deep Learning you deal with high dimensional data sets where dimensions refer to different features present in the data set. In fact, the name “TensorFlow” has been derived from the operations which neural networks perform on tensors. It’s literally a flow of tensors. Since, you have understood what are tensors, let us move ahead in this TensorFlow tutorial and understand – what is TensorFlow?

The sample here is Python, though there is an R library as well.

Comments closed

Creating A Poekr AI In Python

Published 2017-11-03 by Kevin Feasel

Kevin Jacobs has a fairly simple framework for building poker-playing bots:

The bot uses Monte Carlo simulations running from a given state. Suppose you start with 2 high cards (two Kings for example), then the chances are high that you will win. The Monte Carlo simulation then simulates a given number of games from that point and evaluates which percentage of games you will win given these cards. If another King shows during the flop, then your chance of winning will increase. The Monte Carlo simulation starting at that point, will yield a higher winning probability since you will win more games on average.

If we run the simulations, you can see that the bot based on Monte Carlo simulations outperforms the always calling bot. If you start with a stack of $100,-, you will on average end with a stack of $120,- (when playing against the always-calling bot).

It’s a start, and an opening for more sophisticated logic and analysis.

Comments closed

The Importance Of Distributions

Published 2017-11-03 by Kevin Feasel

Jocelyn Barker explains distributions using role-playing games as an example:

We see that for the entire curve, our odds of success goes down when we add criticals and for most of the curve, it goes up for 3z8. Lets think about why. We know the guards are more likely to roll a 20 and less likely to roll a 1 from the distribution we made earlier. This happens about 14% of the time, which is pretty common, and when it happens, the rogue has to have a very high modifier and still roll well to overcome it unless they also roll a 20. On the other hand, with 3z8 system, criticals are far less common and everyone rolls close to average more of the time. The expected value for the rogue is ~10.5, where as it is ~14 for the guards, so when everyone performs close to average, the rogue only needs a small modifier to have a reasonable chance of success.

It’s a nice spin on a classic statistics lesson.

Comments closed

Team Data Science Process Updates

Published 2017-11-02 by Kevin Feasel

David Smith announces updates to the Team Data Science Process:

It’s been over a year since we first introduced introduced the Team Data Science Process (TDSP). The data, technology and practices behind Data Science continue to evolve, and the TDSP has evolved in parallel. Over the past year, several new facets have been added, including:

The IDEAR (Interactive Data Exploration, Analysis and Reporting) framework, an open source extension to R and Python designed to standardize the process of data exploration and reporting;
Guidance for use of Spark 2.0, including an end-to-end Spark v2.0 walkthrough;
Guidance for use of in-database Python with SQL Server, including an end-to-end in-database Python tutorial;

Click through for more changes, as well as links to further resources.

Comments closed

Text Preprocessing With R

Published 2017-11-01 by Kevin Feasel

Sibanjan Das has started a new series on text mining in R:

Next, we need to preprocess the text to convert it into a format that can be processed for extracting information. It is essential to reduce the size of the feature space before analyzing the text. There are various preprocessing methods that we can use here, such as stop word removal, case folding, stemming, lemmatization, and contraction simplification. However, it is not necessary to apply all of the normalization methods to the text. It depends on the data we retrieve and the kind of analysis to be performed.

The series starts off with a quick description of some preprocessing steps and then building an LDA model to extract key terms from articles.

Comments closed

Kaggle Data Science Report For 2017

Published 2017-10-31 by Kevin Feasel

Mark McDonald rounds up a few notebooks covering a recent Kaggle survey:

In 2017 we conducted our first ever extra-large, industry-wide survey to captured the state of data science and machine learning.

As the data science field booms, so has our community. In 2017 we hit a new milestone of reaching over 1M registered data scientists from almost every country in the world. Representing many different backgrounds, skill levels, and professions, we were excited to ask our community a wide range of questions about themselves, their skills, and their path to data science. We asked them everything from “what’s your yearly salary?” to “what’s your favorite data science podcasts?” to “what barriers are faced at work?”, letting us piece together key insights about the people and the trends behind the machine learning models.

Without further ado, we’d love to share everything with you. Over 16,000 responses surveys were submitted, with over 6 full months of aggregated time spent completing it (an average response time of more than 16 minutes).

Click through for a few reports. Something interesting to me is that the top languages/tools were, in order, Python, R, and SQL. For the particular market niche that Kaggle competitions fit, that makes a lot of sense: I tend to like R more for data exploration and data cleansing, but much of that work is already done by the time you get the dataset.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31