Reproducibility And ML Projects

Pete Warden explains some of the difficulties around reproducing ML models:

Why does this all matter? I’ve had several friends contact me about their struggles reproducing published models as baselines for their own papers. If they can’t get the same accuracy that the original authors did, how can they tell if their new approach is an improvement? It’s also clearly concerning to rely on models in production systems if you don’t have a way of rebuilding them to cope with changed requirements or platforms. At that point your model moves from being a high-interest credit card of technical debt to something more like what a loan-shark offers. It’s also stifling for research experimentation; since making changes to code or training data can be hard to roll back it’s a lot more risky to try different variations, just like coding without source control raises the cost of experimenting with changes.

It’s not all doom and gloom, there are some notable efforts around reproducibility happening in the community. One of my favorites is the TensorFlow Benchmarks project Toby Boyd’s leading. He’s made it his team’s mission not only to lay out exactly how to train some of the leading models from scratch with high training speed on a lot of different platforms, but also ensures that the models train to the expected accuracy. I’ve seen him sweat blood trying to get models up to that precision, since variations in any of the steps I listed above can affect the results and there’s no easy way to debug what the underlying cause is, even with help from the authors. It’s also a never-ending job, since changes in TensorFlow, in GPU drivers, or even datasets, can all hurt accuracy in subtle ways. By doing this work, Toby’s team helps us spot and fix bugs caused by changes in TensorFlow in the models they cover, and chase down issues caused by external dependencies, but it’s hard to scale beyond a comparatively small set of platforms and models.

I see two separate problems:  reproducing the process and reproducing the result.  Reproducing the process is why you want to use something like notebooks:  it’s a proof that you (and others!) can generate the same type of model the same way multiple times.  Reproducing the result is harder given the stochastic nature of ML, but if you’re following the same process, you’re at least more likely to end up close to the same result.

XGBoost With Python

Fisseha Berhane looked at Extreme Gradient Boosting with R and now covers it in Python:

In both R and Python, the default base learners are trees (gbtree) but we can also specify gblinear for linear models and dart for both classification and regression problems.
In this post, I will optimize only three of the parameters shown above and you can try optimizing the other parameters. You can see the list of parameters and their details from the website.

It’s hard to overstate just how valuable XGBoost is as an algorithm.

Calling Azure Cognitive Services From SSIS

Rolf Tesmer shows off how easy it is to call Azure Cognitive Services from SQL Server Integration Services:

My SQL SSIS package leverages the Translator Text API service.  For those who want to learn the secret sauce then I suggest to check here –

essentially this API is pretty simple;

  1. It accepts source textsource language and target language.  (The API can translate to/from over 60 different languages.)

  2. You call the API with your request parameters + API Key

  3. The API will respond with the language translation of the source text you sent in

  4. So Simple, so fast, so effective!

Click through for the full post.  It really is simple.

Data Modeling And Neural Networks

I have two new posts in my launching a data science project series.  The first one covers data modeling theory:

Wait, isn’t self-supervised learning just a subset of supervised learning?  Sure, but it’s pretty useful to look at on its own.  Here, we use heuristics to guesstimate labels and train the model based on those guesstimates.  For example, let’s say that we want to train a neural network or Markov chain generator to read the works of Shakespeare and generate beautiful prose for us.  The way the recursive model would work is to take what words have already been written and then predict the most likely next word or punctuation character.

We don’t have “labeled” data within the works of Shakespeare, though; instead, our training data’s “label” is the next word in the play or sonnet.  So we train our model based on the chains of words, treating the problem as interdependent rather than a bunch of independent words just hanging around.

Then, we implement a data model using a neural network:

At this point, I want to build the Keras model. I’m creating a build_model function in case I want to run this over and over. In a real-life scenario, I would perform various optimizations, do cross-validation, etc. In this scenario, however, I am just going to run one time against the full training data set, and then evaluate it against the test data set.

Inside the function, we start by declaring a Keras model. Then, I add three layers to the model. The first layer is a dense (fully-connected) layer which accepts the training data as inputs and uses the Rectified Linear Unit (ReLU) activation mechanism. This is a decent first guess for activation mechanisms. We then have a dropout layer, which reduces the risk of overfitting on the training data. Finally, I have a dense layer for my output, which will give me the salary.

I compile the model using the RMSProp optimizer. This is a good default optimizer for neural networks, although you might try AdagradAdam, or AdaMax as well. Our loss function is Mean Squared Error, which is good for dealing with finding the error in a regression. Finally, I’m interested in the Mean Absolute Error–that is, the dollar amount difference between our function’s prediction and the actual salary. The closer to $0 this is, the better.

Click through for those two posts, including seeing how close I get to a reasonable model with my neural network.

Image Recognition Using Viola-Jones

Ellen Talbot lays out some of the basics of image recognition:

Aggregate channel features (ACF) is a variation of channel features, which extracts features directly as pixel values in extended channels without computing rectangular sums at various locations and scales.

Common channels include the colour channels, such as grey-scale and RBG, but many other channels can be encoded, depending on the difficulty of your problem (e.g. gradient magnitude and gradient histograms).

ACF has advantages, such as a richer representation, accelerated detection speed and more accurate localisation of objects in the images when used in conjunction with a boosting method.

Click through for more, including a few resources around the Viola-Jones algorithm.

Microsoft ML Server 9.3 Released

Nagesh Pabbisetty announces Microsoft Machine Learning Server 9.3:

In ML Server 9.3, we have added support for SQL compute context in ML Server and in R Client running on Linux platforms, so data scientists who work on Linux workstations can directly use in-database analytics with SQL Server compute context. Additionally, the SQLRUtils package can now be used to package the R scripts into T-SQL stored procedures and run them from R environment on Linux clients.

An interesting scenario enabled by the addition of SQL Server Compute context in ML Server running on Linux is that organizations can now provide a browser-based interface for accessing SQL Server compute context with R Studio Server and ML Server running on a Linux machine connecting to SQL Server.

Since introducing revoscalepy library in the last release of ML Server and SQL Server 2017, we have shipped several additions and improvements in the Python APIs as part of CU releases of SQL Server 2017. We have added APIs like rx_create_col_info, rx_get_var_info etc. that make it easier to get column information, esp. with large number of columns. We added rx_serialize_model for easy model serialization. We have also improved performance when working with string data in different scenarios.

This also gets you up to R 3.4.3. H/T David Smith

The Whys Of Azure ML Workbench

Ginger Grant explains why Azure Machine Learning Workbench exists:

Microsoft is looking for Azure Machine Learning Workbench for more than a tool to use for Machine Learning analysis. It is part of a system to manage and monitor the deployment of machine learning solutions with Azure Machine Learning Model Management. The management aspects are part of the application installation.  To install the Azure Machine Learning Workbench, the application download is available only by creating an account in Microsoft’s Azure environment, where a Machine Learning Model Management resource will be created as part of the install. Within this resource, you will be directed to create a virtual environment in Azure where you will be deploying and managing Machine Learning models.

This migration into management of machine learning components is part of a pattern first seen on the on-premises version of data science functionality.  First Microsoft helped companies manage the deployment of R code with SQL Server 2016 which includes the ability to move R code into SQL Server.  Providing this capability decreased the time it took to implement a data science solution by providing a means for the code can be deployed easily without the need for the R code to be re-written or included in another application. SQL Server 2017 expanded on this idea by allowing Python code to be deployed into SQL Server as well.  With the cloud service Model Management, Microsoft is hoping to centralize the implementation so that all Machine Learning services created can be managed in one place.

Read on for more.

Markov Chains In Python

Sandipan Dey shows off various uses of Markov chains as well as how to create one in Python:

Perspective. In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of information theory and revolutionized the telecommunications industry, laying the groundwork for today’s Information Age. In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering. They also have many scientific computing applications including the genemark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google’s PageRank algorithm for Web search. For this assignment, we consider a whimsical variant: generating stylized pseudo-random text.

Markov chains are a venerable statistical technique and formed the basis of a lot of text processing (especially text generation) due to the algorithm’s relatively low computational requirements.

Anomaly Detection With Python

Robert Sheldon continues his SQL Server Machine Learning Series:

As important as these concepts are to working Python and MLS, the purpose in covering them was meant only to provide you with a foundation for doing what’s really important in MLS, that is, using Python (or the R language) to analyze data and present the results in a meaningful way. In this article, we start digging into the analytics side of Python by stepping through a script that identifies anomalies in a data set, which can occur as a result of fraud, demographic irregularities, network or system intrusion, or any number of other reasons.

The article uses a single example to demonstrate how to generate training and test data, create a support vector machine (SVM) data model based on the training data, score the test data using the SVM model, and create a scatter plot that shows the scoring results.

Click through to see the scenario that Robert has laid out as an example.

Non-English Natural Language Processing

The folks at BNOSAC have announced a new natural language processing toolkit for R:

BNOSAC is happy to announce the release of the udpipe R package ( which is a Natural Language Processing toolkit that provides language-agnostic ‘tokenization’, ‘parts of speech tagging’, ‘lemmatization’, ‘morphological feature tagging’ and ‘dependency parsing’ of raw text. Next to text parsing, the package also allows you to train annotation models based on data of ‘treebanks’ in ‘CoNLL-U’ format as provided at

The package provides direct access to language models trained on more than 50 languages.

Click through to check it out.


March 2018
« Feb