Wrapping Up A Data Science Project

I have finished my series on launching a data science project.  First, I have a post on deploying models as microservices:

The other big shift is a shift away from single, large services which try to solve all of the problems.  Instead, we’ve entered the era of the microservice:  a small service dedicated to providing a single answer to a single problem.  A microservice architecture lets us build smaller applications geared toward solving the domain problem rather than trying to solve the integration problem.  Although you can definitely configure other forms of interoperation, most microservices typically are exposed via web calls and that’s the scenario I’ll discuss today.  The biggest benefit to setting up a microservice this way is that I can write my service in R, you can call it from your Python service, and then some .NET service could call yours, and nobody cares about the particular languages used because they all speak over a common, known protocol.

One concern here is that you don’t want to waste your analysts time learning how to build web services, and that’s where data science workbenches and deployment tools like DeployRcome into play.  These make it easier to deploy scalable predictive services, allowing practitioners to build their R scripts, push them to a service, and let that service host the models and turn function calls into API calls automatically.

But if you already have application development skills on your team, you can make use of other patterns.  Let me give two examples of patterns that my team has used to solve specific problems.

Then, I talk about the iterative nature of post-deployment life:

At this point in the data science process, we’ve launched a product into production.  Now it’s time to kick back and hibernate for two months, right?  Yeah, about that…

Just because you’ve got your project in production doesn’t mean you’re done.  First of all, it’s important to keep checking the efficacy of your models.  Shift happens, where a model might have been good at one point in time but becomes progressively worse as circumstances change.  Some models are fairly stable, where they can last for years without significant modification; others have unstable underlying trends, to the point that you might need to retrain such a model continuously.  You might also find out that your training and testing data was not truly indicative of real-world data, especially that the real world is a lot messier than what you trained against.

The best way to guard against unbeknownst model shift is to take new production data and retrain the model.  This works best if you can keep track of your model’s predictions versus actual outcomes; that way, you can tell the actual efficacy of the model, figuring out how frequently and by how much your model was wrong.

This was a fun series to write and will be interesting to come back to in a couple of years to see how much I disagree with the me of now.

Related Posts

Tidy Anomaly Detection With Anomalize

Abdul Majed Raja walks us through an example using the anomalize package: One of the important things to do with Time Series data before starting with Time Series forecasting or Modelling is Time Series Decomposition where the Time series data is decomposed into Seasonal, Trend and remainder components. anomalize has got a function time_decompose() to perform the same. […]

Read More

Uploading Data Sets To Azure ML From R

Leila Etaati continues her series on the Azure ML R package by showing how to upload a data set: There is a function in AzureML package name “workspace” that creates a reference to an AzureML Studio workspace by getting the authentication token and workspace id as below: 1 ws <– workspace( id , auth  ) to […]

Read More


March 2018
« Feb Apr »