Category: Data Science

The Data Exploration Process

Published 2018-03-27 by Kevin Feasel

Stacia Varga takes a step back from analyzing NHL data to explore it a little more:

As I mentioned in my last post, I am currently in an exploratory phase with my data analytics project. Although I would love to dive in and do some cool predictive analytics or machine learning projects, I really need to continue learning as much about my data as possible before diving into more advanced techniques.

My data exploration process has the following four steps:

Assess the data that I have at a high level
Determine how this data is relevant to the analytics project I want to undertake
Get a general overview of the data characteristics by calculating simple statistics
Understand the “middles” and the “ends” of your numeric data points

There’s some good stuff in here. I particularly appreciate Stacia’s consideration of data exploration as an iterative process.

Comments closed

Multi-Class Text Classification In Python

Published 2018-03-22 by Kevin Feasel

Susan Li has a series on multi-class text classification in Python. First up is analysis with PySpark:

Our task is to classify San Francisco Crime Description into 33 pre-defined categories. The data can be downloaded from Kaggle.

Given a new crime description comes in, we want to assign it to one of 33 categories. The classifier makes the assumption that each new crime description is assigned to one and only one category. This is multi-class text classification problem.

* Input: Descript

* Example: “STOLEN AUTOMOBILE”

* Output: Category

* Example: VEHICLE THEFT

To solve this problem, we will use a variety of feature extraction technique along with different supervised machine learning algorithms in Spark. Let’s get started!

Then, she looks at multi-class text classification with scikit-learn:

The classifiers and learning algorithms can not directly process the text documents in their original form, as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. Therefore, during the preprocessing step, the texts are converted to a more manageable representation.

One common approach for extracting features from the text is to use the bag of words model: a model where for each document, a complaint narrative in our case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored.

Specifically, for each term in our dataset, we will calculate a measure called Term Frequency, Inverse Document Frequency, abbreviated to tf-idf.

This is a nice pair of articles on the topic. Natural Language Processing (and dealing with text in general) is one place where Python is well ahead of R in terms of functionality and ease of use.

Comments closed

The Microsoft Team Data Science Process Lifecycle Versus CRISP-DM

Published 2018-03-20 by Kevin Feasel

Melody Zacharias compares Microsoft’s Team Data Science Process lifecycle with the CRISP-DM process:

As I pointed out in my previous blog, the TDSP lifecycle is made up of five iterative stages:

Business Understanding

Data Acquisition and Understanding

Modeling

Deployment

Customer Acceptance

This is not very different from the six major phases used by the Cross Industry Standard Process for Data Mining (“CRISP-DM”).

This is part of a series on data science that Melody is putting together, so check it out.

Comments closed

Exploratory Analysis With Hockey Data In Power BI

Published 2018-03-20 by Kevin Feasel

Stacia Varga digs into her hockey data set a bit more:

Once I know whether a variable is numerical or categorical, I can compute statistics appropriately. I’ll be delving into additional types of statistics later, but the very first, simplest statistics that I want to review are:

Counts for a categorical variable

Minimum and maximum values in addition to mean and median for a numerical value

To handle my initial analysis of the categorical variables, I can add new measures to the modelto compute the count using a DAX formula like this, since each row in the games table is unique:

Game Count = countrows(games)

It’s interesting seeing Stacia use Power BI for exploratory analysis. My personal preference would definitely be to dump the data into R, but there’s more than one way to analyze a data set.

Comments closed

vtreat

Published 2018-03-16 by Kevin Feasel

John Mount explains the vtreat package that he and Nina Zumel have put together:

When attempting predictive modeling with real-world data you quicklyrun into difficulties beyond what is typically emphasized in machine learning coursework:

Missing, invalid, or out of range values.

Categorical variables with large sets of possible levels.

Novel categorical levels discovered during test, cross-validation, or model application/deployment.

Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).

Nested model bias poisoning results in non-trivial data processing pipelines.

Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.

vtreat systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.

That’s immediately going onto my learn-more list.

Comments closed

Wrapping Up A Data Science Project

Published 2018-03-15 by Kevin Feasel

I have finished my series on launching a data science project. First, I have a post on deploying models as microservices:

The other big shift is a shift away from single, large services which try to solve all of the problems. Instead, we’ve entered the era of the microservice: a small service dedicated to providing a single answer to a single problem. A microservice architecture lets us build smaller applications geared toward solving the domain problem rather than trying to solve the integration problem. Although you can definitely configure other forms of interoperation, most microservices typically are exposed via web calls and that’s the scenario I’ll discuss today. The biggest benefit to setting up a microservice this way is that I can write my service in R, you can call it from your Python service, and then some .NET service could call yours, and nobody cares about the particular languages used because they all speak over a common, known protocol.

One concern here is that you don’t want to waste your analysts time learning how to build web services, and that’s where data science workbenches and deployment tools like DeployRcome into play. These make it easier to deploy scalable predictive services, allowing practitioners to build their R scripts, push them to a service, and let that service host the models and turn function calls into API calls automatically.

But if you already have application development skills on your team, you can make use of other patterns. Let me give two examples of patterns that my team has used to solve specific problems.

Then, I talk about the iterative nature of post-deployment life:

At this point in the data science process, we’ve launched a product into production. Now it’s time to kick back and hibernate for two months, right? Yeah, about that…

Just because you’ve got your project in production doesn’t mean you’re done. First of all, it’s important to keep checking the efficacy of your models. Shift happens, where a model might have been good at one point in time but becomes progressively worse as circumstances change. Some models are fairly stable, where they can last for years without significant modification; others have unstable underlying trends, to the point that you might need to retrain such a model continuously. You might also find out that your training and testing data was not truly indicative of real-world data, especially that the real world is a lot messier than what you trained against.

The best way to guard against unbeknownst model shift is to take new production data and retrain the model. This works best if you can keep track of your model’s predictions versus actual outcomes; that way, you can tell the actual efficacy of the model, figuring out how frequently and by how much your model was wrong.

This was a fun series to write and will be interesting to come back to in a couple of years to see how much I disagree with the me of now.

Comments closed

XGBoost With Python

Published 2018-03-13 by Kevin Feasel

Fisseha Berhane looked at Extreme Gradient Boosting with R and now covers it in Python:

In both R and Python, the default base learners are trees (gbtree) but we can also specify gblinear for linear models and dart for both classification and regression problems.
In this post, I will optimize only three of the parameters shown above and you can try optimizing the other parameters. You can see the list of parameters and their details from the website.

It’s hard to overstate just how valuable XGBoost is as an algorithm.

Comments closed

Experimenting With The Data Professional Salary Survey

Published 2018-03-13 by Kevin Feasel

Mala Mahadevan investigates a potential correlation in the data professional salary survey:

The questions I was looking at are as below:
1 Is there any correlation between experience and number of hours worked?
2 Is there any correlation between experience and job duties/kinds of tasks performed?
3 Is there any correlation between experience and managing staff – ie – do more people with experience take to management as a form of progress?

I am using this blog post to explore question 1.

Click through to see if there is a correlation between experience and hours worked. One critique I have is that years of experience is not normally distributed: there’s a hard cutoff at 0, so although the possible range does follow what a hypothetical normal distribution would do (and it doesn’t really affect the analysis Mala did), that difference can be important in other analyses.

Comments closed

Disagreement On Outliers

Published 2018-03-09 by Kevin Feasel

Antony Unwin reviews how various packages track outliers using the Overview of Outliers plot in R:

The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the OutliersO3 vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

Interesting reading.

Comments closed

Data Modeling And Neural Networks

Published 2018-03-09 by Kevin Feasel

I have two new posts in my launching a data science project series. The first one covers data modeling theory:

Wait, isn’t self-supervised learning just a subset of supervised learning? Sure, but it’s pretty useful to look at on its own. Here, we use heuristics to guesstimate labels and train the model based on those guesstimates. For example, let’s say that we want to train a neural network or Markov chain generator to read the works of Shakespeare and generate beautiful prose for us. The way the recursive model would work is to take what words have already been written and then predict the most likely next word or punctuation character.

We don’t have “labeled” data within the works of Shakespeare, though; instead, our training data’s “label” is the next word in the play or sonnet. So we train our model based on the chains of words, treating the problem as interdependent rather than a bunch of independent words just hanging around.

Then, we implement a data model using a neural network:

At this point, I want to build the Keras model. I’m creating a build_model function in case I want to run this over and over. In a real-life scenario, I would perform various optimizations, do cross-validation, etc. In this scenario, however, I am just going to run one time against the full training data set, and then evaluate it against the test data set.

Inside the function, we start by declaring a Keras model. Then, I add three layers to the model. The first layer is a dense (fully-connected) layer which accepts the training data as inputs and uses the Rectified Linear Unit (ReLU) activation mechanism. This is a decent first guess for activation mechanisms. We then have a dropout layer, which reduces the risk of overfitting on the training data. Finally, I have a dense layer for my output, which will give me the salary.

I compile the model using the RMSProp optimizer. This is a good default optimizer for neural networks, although you might try Adagrad, Adam, or AdaMax as well. Our loss function is Mean Squared Error, which is good for dealing with finding the error in a regression. Finally, I’m interested in the Mean Absolute Error–that is, the dollar amount difference between our function’s prediction and the actual salary. The closer to $0 this is, the better.

Click through for those two posts, including seeing how close I get to a reasonable model with my neural network.

Comments closed