Press "Enter" to skip to content

Category: Machine Learning

xgboost and Small Numbers of Subtrees

John Mount covers an interesting issue you can run into when using xgboost:

While reading Dr. Nina Zumel’s excellent note on bias in common ensemble methods, I ran the examples to see the effects she described (and I think it is very important that she is establishing the issue, prior to discussing mitigation).
In doing that I ran into one more avoidable but strange issue in using xgboost: when run for a small number of rounds it at first appears that xgboost doesn’t get the unconditional average or grand average right (let alone the conditional averages Nina was working with)!

It’s not something you’ll hit very often, but if you’re trying xgboost against a small enough data set with few enough rounds, it is something to keep in mind.

Comments closed

A Quick Keras Example

Shubham Dangare takes us through a quick example using Keras and TensorFlow in Python:

Keras is a high-level neural networks API, written in Python and capable of running on top of Tensorflow, CNTK  or Theano. It was developed with a focus on enabling fast experimentation. In this blog, we are going to cover one small case study for fashion mnist.

Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28×28 grayscale image, associated with a label from 10 classes. Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.

The end result wasn’t that great, but Shubham was using a sequential model rather than a convolutional neural network, so you can probably take this as a starting point and improve upon it.

Comments closed

ML Services and Injectable Code

Grant Fritchey looks at sp_execute_external_script for potential SQL injection vulnerabilities:

The sharp eyed will see that the data set is defined by SQL. So, does that suffer from injection attacks? Short answer is no. If there was more than one result set within the Python code, it’s going to error out. So you’re protected there.

This is important, because the data set query can be defined with parameters. You can pass values to those parameters, heck, you’re likely to pass values to those parameters, from the external query or procedure. So, is that an attack vector?

No.

Another factor is that you need explicitly to grant EXECUTE ANY EXTERNAL SCRIPT rights to non-sysadmin, non-db_owner users, meaning a non-privileged user can’t execute external scripts at all. You can also limit the executing service account

Comments closed

An Overview of Convolutional Neural Networks

Beth Ebersole explains what convolutional neural networks are and how they work:

Let’s quickly review neural networks.

Neural networks are universal approximators. This means that with enough neurons and time, a neural network can model any input/output relationship, to any degree of precision.

A standard feed forward neural network receives an input (vector) and feeds it forward through hidden layers to an output. SAS PROC NNET, for example, trains a multilayer perceptron neural network. As the name “multilayer” implies, there are multiple layers. Below we see the inputs (features), one hidden layer and the output (response, target). Each neuron is simply a mathematical function.

This is a complicated topic explained well. It’s also an overview more than a tutorial.

Comments closed

Patterns for ML Models in Production

Jeff Fletcher shows four patterns for productionalizing Machine Learning models, as well as some things to take care of once you’re in production:

Operational Databases
This option is sometimes considered to be  real-time as the information is provided “as its needed,” but it is still a batch method. Using our telco example, a batch process can be run at night that will make a prediction for each customer, and an operational database is updated with the most recent prediction. The call center agent software can then fetch this prediction for the customer when they call in, and the agent can take action accordingly.

Read on for more.

Comments closed

Hyperparameter Tuning with MLflow

Joseph Bradley shows how you can perform hyperparameter tuning of an MLlib model with MLflow:

Apache Spark MLlib users often tune hyperparameters using MLlib’s built-in tools CrossValidator and TrainValidationSplit.  These use grid search to try out a user-specified set of hyperparameter values; see the Spark docs on tuning for more info.

Databricks Runtime 5.3 and 5.3 ML and above support automatic MLflow tracking for MLlib tuning in Python.

With this feature, PySpark CrossValidator and TrainValidationSplit will automatically log to MLflow, organizing runs in a hierarchy and logging hyperparameters and the evaluation metric.  For example, calling CrossValidator.fit() will log one parent run.  Under this run, CrossValidator will log one child run for each hyperparameter setting, and each of those child runs will include the hyperparameter setting and the evaluation metric.  Comparing these runs in the MLflow UI helps with visualizing the effect of tuning each hyperparameter.

Hyperparameter tuning is critical for some of the more complex algorithms like random forests, gradient boosting, and neural networks.

Comments closed

TensorFrames: Spark Plus TensorFlow

Adi Polak gives us an introduction to TensorFrames:

In all TensorFrames functionality, the DataFrame is sent together with the computations graph. The DataFrame represents the distributed data, meaning in every machine there is a chunk of the data that will go through the graph operations/ transformations. This will happen in every machine with the relevant data. Tungsten binary format is the actual binary in-memory data that goes through the transformation, first to Apache Spark Java object and from there it is sent to TensorFlow Jave API for graph calculations. This all happens in the Spark Worker process, the Spark worker process can spin many tasks which mean various calculation at the same time over the in-memory data.

An interesting bit of turnabout here is that the Scala API is the underdeveloped one; normally for Spark, the Python API is the Johnny-Come-Lately version.

Comments closed

Using the ML.NET Model Builder

I have a post looking at the ML.NET Model Builder:

You have four options from which to choose: two-class classification, multi-class classification, regression, or Choose Your Own Adventure. Today, we’re going to create a two-class classification model. Incidentally, they’re not kidding about things changing in preview—last time I looked at this, they didn’t have multi-class classifiers available.

Once you select Sentiment Analysis (that is, two-class classification of text), you can figure out how to feed data to this trainer.

I think this is fine for developers who are looking to add a machine learning component as a small part of a bigger product. I don’t think it will beat a trained human using R or Python, but it’s an interesting avenue.

Comments closed

Combining Machine Learning with DevOps

Rolf Tesmer explains that machine learning and DevOps aren’t oil and water (or maybe they are and we just need to stir harder):

In talking with various development teams, customers and DevOps engineers, a lot of the potential problems of meshing ML development into an enterprise DevOps process can be boiled down to a few different areas this aims to address…

ML stack might be different from rest of the application stack
– Testing accuracy of ML model
– ML code is not always version controlled
– Hard to reproduce models (ie explainability)
– Need to re-write featurizing + scoring code into different languages
– Hard to track breaking changes
– Difficult to monitor models & determine when to retrain

So DevOps helps with this, right? Right?

Well er, some of them yes, but not all.

DevOps is not a panacea but it can solve certain types of problems well.

Comments closed

MLflow 1.0 Released

Clemens Mewald and Matei Zaharia announce the release of MLflow 1.0:

Today we are excited to announce the release of MLflow 1.0. Since its launch one year ago, MLflow has been deployed at thousands of organizations to manage their production machine learning workloads, and has become generally available on services like Managed MLflow on Databricks. The MLflow community has grown to over 100 contributors, and the MLflow PyPI package download rate has reached close to 600K times a month. The 1.0 release not only marks the maturity and stability of the APIs, but also adds a number of frequently requested features and improvements.

The release is publicly available starting today. Install MLflow 1.0 using PyPl, read our documentation to get started, and provide feedback on GitHub. Below we describe just a few of the new features in MLflow 1.0. Please refer to the release notes for a full list.

And it looks like they’re going to keep pushing on it from there.

Comments closed