The Microsoft Team Data Science Process Lifecycle Versus CRISP-DM

Melody Zacharias compares Microsoft’s Team Data Science Process lifecycle with the CRISP-DM process:

As I pointed out in my previous blog, the TDSP lifecycle is made up of five iterative stages:

  1. Business Understanding
  2. Data Acquisition and Understanding
  3. Modeling
  4. Deployment
  5. Customer Acceptance

This is not very different from the six major phases used by the Cross Industry Standard Process for Data Mining (“CRISP-DM”).

This is part of a series on data science that Melody is putting together, so check it out.

Exploratory Analysis With Hockey Data In Power BI

Stacia Varga digs into her hockey data set a bit more:

Once I know whether a variable is numerical or categorical, I can compute statistics appropriately. I’ll be delving into additional types of statistics later, but the very first, simplest statistics that I want to review are:

  • Counts for a categorical variable
  • Minimum and maximum values in addition to mean and median for a numerical value

To handle my initial analysis of the categorical variables, I can add new measures to the modelto compute the count using a DAX formula like this, since each row in the games table is unique:

Game Count = countrows(games)

It’s interesting seeing Stacia use Power BI for exploratory analysis.  My personal preference would definitely be to dump the data into R, but there’s more than one way to analyze a data set.


John Mount explains the vtreat package that he and Nina Zumel have put together:

When attempting predictive modeling with real-world data you quicklyrun into difficulties beyond what is typically emphasized in machine learning coursework:

  • Missing, invalid, or out of range values.
  • Categorical variables with large sets of possible levels.
  • Novel categorical levels discovered during test, cross-validation, or model application/deployment.
  • Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).
  • Nested model bias poisoning results in non-trivial data processing pipelines.

Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.

vtreat systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.

That’s immediately going onto my learn-more list.

Wrapping Up A Data Science Project

I have finished my series on launching a data science project.  First, I have a post on deploying models as microservices:

The other big shift is a shift away from single, large services which try to solve all of the problems.  Instead, we’ve entered the era of the microservice:  a small service dedicated to providing a single answer to a single problem.  A microservice architecture lets us build smaller applications geared toward solving the domain problem rather than trying to solve the integration problem.  Although you can definitely configure other forms of interoperation, most microservices typically are exposed via web calls and that’s the scenario I’ll discuss today.  The biggest benefit to setting up a microservice this way is that I can write my service in R, you can call it from your Python service, and then some .NET service could call yours, and nobody cares about the particular languages used because they all speak over a common, known protocol.

One concern here is that you don’t want to waste your analysts time learning how to build web services, and that’s where data science workbenches and deployment tools like DeployRcome into play.  These make it easier to deploy scalable predictive services, allowing practitioners to build their R scripts, push them to a service, and let that service host the models and turn function calls into API calls automatically.

But if you already have application development skills on your team, you can make use of other patterns.  Let me give two examples of patterns that my team has used to solve specific problems.

Then, I talk about the iterative nature of post-deployment life:

At this point in the data science process, we’ve launched a product into production.  Now it’s time to kick back and hibernate for two months, right?  Yeah, about that…

Just because you’ve got your project in production doesn’t mean you’re done.  First of all, it’s important to keep checking the efficacy of your models.  Shift happens, where a model might have been good at one point in time but becomes progressively worse as circumstances change.  Some models are fairly stable, where they can last for years without significant modification; others have unstable underlying trends, to the point that you might need to retrain such a model continuously.  You might also find out that your training and testing data was not truly indicative of real-world data, especially that the real world is a lot messier than what you trained against.

The best way to guard against unbeknownst model shift is to take new production data and retrain the model.  This works best if you can keep track of your model’s predictions versus actual outcomes; that way, you can tell the actual efficacy of the model, figuring out how frequently and by how much your model was wrong.

This was a fun series to write and will be interesting to come back to in a couple of years to see how much I disagree with the me of now.

XGBoost With Python

Fisseha Berhane looked at Extreme Gradient Boosting with R and now covers it in Python:

In both R and Python, the default base learners are trees (gbtree) but we can also specify gblinear for linear models and dart for both classification and regression problems.
In this post, I will optimize only three of the parameters shown above and you can try optimizing the other parameters. You can see the list of parameters and their details from the website.

It’s hard to overstate just how valuable XGBoost is as an algorithm.

Experimenting With The Data Professional Salary Survey

Mala Mahadevan investigates a potential correlation in the data professional salary survey:

The questions I was looking at are as below:
1 Is there any correlation between experience and number of hours worked?
2 Is there any correlation between experience and job duties/kinds of tasks performed?
3 Is there any correlation between experience and managing staff – ie – do more people with experience take to management as a form of progress?

I am using this blog post to explore question 1.

Click through to see if there is a correlation between experience and hours worked.  One critique I have is that years of experience is not normally distributed:  there’s a hard cutoff at 0, so although the possible range does follow what a hypothetical normal distribution would do (and it doesn’t really affect the analysis Mala did), that difference can be important in other analyses.

Disagreement On Outliers

Antony Unwin reviews how various packages track outliers using the Overview of Outliers plot in R:

The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the OutliersO3 vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

Interesting reading.

Data Modeling And Neural Networks

I have two new posts in my launching a data science project series.  The first one covers data modeling theory:

Wait, isn’t self-supervised learning just a subset of supervised learning?  Sure, but it’s pretty useful to look at on its own.  Here, we use heuristics to guesstimate labels and train the model based on those guesstimates.  For example, let’s say that we want to train a neural network or Markov chain generator to read the works of Shakespeare and generate beautiful prose for us.  The way the recursive model would work is to take what words have already been written and then predict the most likely next word or punctuation character.

We don’t have “labeled” data within the works of Shakespeare, though; instead, our training data’s “label” is the next word in the play or sonnet.  So we train our model based on the chains of words, treating the problem as interdependent rather than a bunch of independent words just hanging around.

Then, we implement a data model using a neural network:

At this point, I want to build the Keras model. I’m creating a build_model function in case I want to run this over and over. In a real-life scenario, I would perform various optimizations, do cross-validation, etc. In this scenario, however, I am just going to run one time against the full training data set, and then evaluate it against the test data set.

Inside the function, we start by declaring a Keras model. Then, I add three layers to the model. The first layer is a dense (fully-connected) layer which accepts the training data as inputs and uses the Rectified Linear Unit (ReLU) activation mechanism. This is a decent first guess for activation mechanisms. We then have a dropout layer, which reduces the risk of overfitting on the training data. Finally, I have a dense layer for my output, which will give me the salary.

I compile the model using the RMSProp optimizer. This is a good default optimizer for neural networks, although you might try AdagradAdam, or AdaMax as well. Our loss function is Mean Squared Error, which is good for dealing with finding the error in a regression. Finally, I’m interested in the Mean Absolute Error–that is, the dollar amount difference between our function’s prediction and the actual salary. The closer to $0 this is, the better.

Click through for those two posts, including seeing how close I get to a reasonable model with my neural network.

Be Wary Of Colliders When Analyzing Data

Keith Goldfeld has an interesting demonstration of a collider variable and how it can lead us to incorrect conclusions during analysis:

In this (admittedly thoroughly made-up though not entirely implausible) network diagram, the test score outcome is a collider, influenced by a test preparation class and socio-economic status (SES). In particular, both the test prep course and high SES are related to the probability of having a high test score. One might expect an arrow of some sort to connect SES and the test prep class; in this case, participation in test prep is randomized so there is no causal link (and I am assuming that everyone randomized to the class actually takes it, a compliance issue I addressed in a series of posts starting with this one.)

The researcher who carried out the randomization had a hypothesis that test prep actually is detrimental to college success down the road, because it de-emphasizes deep thinking in favor of wrote memorization. In reality, it turns out that the course and subsequent college success are not related, indicated by an absence of a connection between the course and the long term outcome.

Read the whole thing.  H/T R-Bloggers

XGBoost In R

Fisseha Berhane explains how to implement Extreme Gradient Boosting in R:

What makes it so popular are its speed and performance. It gives among the best performances in many machine learning applications. It is optimized gradient-boosting machine learning library. The core algorithm is parallelizable and hence it can use all the processing power of your machine and the machines in your cluster. In R, according to the package documentation, since the package can automatically do parallel computation on a single machine, it could be more than 10 times faster than existing gradient boosting packages.

xgboost shines when we have lots of training data where the features are numeric or a mixture of numeric and categorical fields. It is also important to note that xgboost is not the best algorithm out there when all the features are categorical or when the number of rows is less than the number of fields (columns).

xgboost is a nice complement to neural networks, as they tend to be great at different things.


March 2018
« Feb