Press "Enter" to skip to content

Category: Data Science

Pandas Basics

Kevin Jacobs has a tutorial on Python’s Pandas library:

There are a few things worth mentioning. Often, Pandas is abbreviated as pd (like Numpy which is often abbreviated as np). If you look at other code, you will see that DataFrames are often abbreviated by df. Here, the DataFrame is constructed using data from a list of lists. The columns argument specifies the keys of the data.

This is a high-level intro, but helps you get your feet wet if you’ve not played with the library.

Comments closed

Housing Prices In Ames, Iowa: A Kaggle Competition

Kathryn Bryant and M. Aaron Owen share their Kaggle experiences.  First, Kathryn, et al:

The lifecycle of our project was a typical one. We started with data cleaning and basic exploratory data analysis, then proceeded to feature engineering, individual model training, and ensembling/stacking. Of course, the process in practice was not quite so linear and the results of our individual models alerted us to areas in data cleaning and feature engineering that needed improvement. We used root mean squared error (RMSE) of log Sale Price to evaluate model fit as this was the metric used by Kaggle to evaluate submitted models.

Data cleaning, EDA, feature engineering, and private train/test splitting (and one spline model!) were all done in R but  we used Python for individual model training and ensembling/stacking. Using R and Python in these ways worked well, but the decision to split work in this manner was driven more by timing than anything else.

Then, Aaron, et al, share their process and findings:

Some variables had a moderate amount of missingness. For example, about 17% of the houses were missing the continuous variable, Lot Frontage, the linear feet of street connected to the property. Intuitively, attributes related to the size of a house are likely important factors regarding the price of the house. Therefore, dropping these variables seems ill-advised.

Our solution was based on the assumption that houses in the same neighborhood likely have similar features. Thus, we imputed the missing Lot Frontage values based on the median Lot Frontage for the neighborhood in which the house with missing value was located.

This is the major upside to Kaggle:  it gives you the ability to work in a controlled environment with real data sets, which include real data problems.  Yeah, the data’s much cleaner than you’d experience in production pretty much anywhere, but that lets you practice technique with a relatively low barrier to entry.  H/T R-Bloggers (Kathryn | Aaron)

Comments closed

Picking A Python IDE

Kevin Jacobs reviews a few Python IDEs from the perspective of a data scientist:

Ladies and gentlemens, this is one of the most perfect IDEs for editing your Python code! At least in my opinion. Jupyter notebook is a web based code editor and can quickly generate visualizations. You can mix up code and text containing no, simple or complex mathematics. One thing I am missing here, is the support for code completion, but there are tons of plugins available so this should be no problem. It is also easy to turn your notebook into a presentation. For collaboration with non-technical teams, this is a great tool.

Conclusion: perfect Python IDE for data science! Less support for code inspection.

Click through for reviews of three IDEs.

Comments closed

Handling Imbalanced Data

Tom Fawcett shows us how to handle a tricky classification problem:

The primary problem is that these classes are imbalanced: the red points are greatly outnumbered by the blue.

Research on imbalanced classes often considers imbalanced to mean a minority class of 10% to 20%. In reality, datasets can get far more imbalanced than this. —Here are some examples:

  1. About 2% of credit card accounts are defrauded per year. (Most fraud detection domains are heavily imbalanced.)
  2. Medical screening for a condition is usually performed on a large population of people without the condition, to detect a small minority with it (e.g., HIV prevalence in the USA is ~0.4%).
  3. Disk drive failures are approximately ~1% per year.
  4. The conversion rates of online ads has been estimated to lie between 10-3 to 10-6.
  5. Factory production defect rates typically run about 0.1%.

Many of these domains are imbalanced because they are what I call needle in a haystackproblems, where machine learning classifiers are used to sort through huge populations of negative (uninteresting) cases to find the small number of positive (interesting, alarm-worthy) cases.

Read on for some good advice on how to handle imbalanced data.

Comments closed

Interpreting P-Value Histograms

David Robinson visualizes and interprets different p-value histograms:

So you’re a scientist or data analyst, and you have a little experience interpreting p-values from statistical tests. But then you come across a case where you have hundreds, thousands, or even millions of p-values. Perhaps you ran a statistical test on each gene in an organism, or on demographics within each of hundreds of counties. You might have heard about the dangers of multiple hypothesis testing before. What’s the first thing you do?

Make a histogram of your p-values. Do this before you perform multiple hypothesis test correction, false discovery rate control, or any other means of interpreting your many p-values. Unfortunately, for some reason, this basic and simple task rarely gets recommended (for instance, the Wikipedia page on the multiple comparisons problem never once mentions this approach). This graph lets you get an immediate sense of how your test behaved across all your hypotheses, and immediately diagnose some potential problems. Here, I’ll walk you through a basic example of interpreting a p-value histogram.

It’s a fun read and informative as well.

Comments closed

The Magic Of Sampling

Nathan LeClaire reminds us of an important story that statisticians have been telling us for a couple centuries:

It starts slowly. Maybe your home-grown centralized logging cluster becomes more difficult to operate, demanding unholy amounts of engineer time every week. Maybe engineers start to find that making a query about production is a “go get a coffee and come back later” activity. Or maybe monitoring vendors offer you a quote that elicits a response ranging anywhere from curses under the breath to blood-curdling screams of terror.

The multi-headed beast we know as Scale has reared its ugly visage.

As some of you may have already guessed from the title, I’m going to discuss one way to solve this problem, and why it might not be as bad as you might think.

Take some of your precious information and throw it in the garbage. In lots of cases, you can just drop those writes on the floor as long as your observability stack is equipped to handle it.

In other words, sample.

Read on for a couple of methods.  One thing I’ve taken a fancy to is collecting the first N of a particular type of message and keeping track of how often that message appears.  If you get the same error for every row in a file, then you might really only need to see that one time and the number of times it happened.  Or maybe you want to see a few of them to ensure that they’re really the same error and not two separate errors which are getting reported together due to insufficient error separation.

Comments closed

Quickly Computing Area Under The Curve

Jean-Francois Puget has a fast method for computing Area Under the Curve in Python:

When the target only takes two values we have a binary classification problem at hand.  Example of binary classification are very common. For instance fraud detection where examples are credit card transactions, features are time, location, amount, merchant id, etc., and target is fraud or not fraud.  Spam detection is also a binary classification where examples are emails, features are the email content as a string of words, and target is spam or not spam.  Without loss of generality we can assume that the target values are 0 and 1, for instance 0 means no fraud or no spam, whiloe 1 means fraud or spam.

For binary classification, predictions are also binary.  Therefore, a prediction is either equal to the target, or is off the mark.  A simple way to evaluate model performance is accuracy: how many predictions are right? For instance, if our test set has 100 examples in it, how many times is the prediction correct?  Accuracy seems a logical way to evaluate performance: a higher accuracy obviously means a better model.  At least this is what people think when they are exposed to the first time to binary classification problems.  Issue is that accuracy can be extremely misleading.

Read Jean-Francois’ explanation and scroll down for the Python sample.

Comments closed

TensorFlow Tutorial

Ashish Bakshi has a TensorFlow tutorial:

As shown in the image above, tensors are just multidimensional arrays, that allows you to represent data having higher dimensions. In general, Deep Learning you deal with high dimensional data sets where dimensions refer to different features present in the data set. In fact, the name “TensorFlow” has been derived from the operations which neural networks perform on tensors. It’s literally a flow of tensors. Since, you have understood what are tensors, let us move ahead in this TensorFlow tutorial and understand – what is TensorFlow?

The sample here is Python, though there is an R library as well.

Comments closed

Creating A Poekr AI In Python

Kevin Jacobs has a fairly simple framework for building poker-playing bots:

The bot uses Monte Carlo simulations running from a given state. Suppose you start with 2 high cards (two Kings for example), then the chances are high that you will win. The Monte Carlo simulation then simulates a given number of games from that point and evaluates which percentage of games you will win given these cards. If another King shows during the flop, then your chance of winning will increase. The Monte Carlo simulation starting at that point, will yield a higher winning probability since you will win more games on average.

If we run the simulations, you can see that the bot based on Monte Carlo simulations outperforms the always calling bot. If you start with a stack of $100,-, you will on average end with a stack of $120,- (when playing against the always-calling bot).

It’s a start, and an opening for more sophisticated logic and analysis.

Comments closed

The Importance Of Distributions

Jocelyn Barker explains distributions using role-playing games as an example:

We see that for the entire curve, our odds of success goes down when we add criticals and for most of the curve, it goes up for 3z8. Lets think about why. We know the guards are more likely to roll a 20 and less likely to roll a 1 from the distribution we made earlier. This happens about 14% of the time, which is pretty common, and when it happens, the rogue has to have a very high modifier and still roll well to overcome it unless they also roll a 20. On the other hand, with 3z8 system, criticals are far less common and everyone rolls close to average more of the time. The expected value for the rogue is ~10.5, where as it is ~14 for the guards, so when everyone performs close to average, the rogue only needs a small modifier to have a reasonable chance of success.

It’s a nice spin on a classic statistics lesson.

Comments closed