The tool that we are going to use to make a classifier is called a convolutional neural network, or CNN. You can find a great explanation of what these are right here on wikipedia.
But we are not going to fully train one ourselves: that would take way more time than I would be willing to spend. Instead, we are going to do transfer learning, where we take a pre-trained CNN and replace only the last layer by a layer of our own. Then we only need to train that single layer, as all the other layers already have weights that are quite sensible. Here we exploit the fact that the images we are interested in have a lot of the same properties as those images that the original network was trained on. You can find a great explanation of transfer learning here.
Read on for a detailed example.
While reading Dr. Nina Zumel’s excellent note on bias in common ensemble methods, I ran the examples to see the effects she described (and I think it is very important that she is establishing the issue, prior to discussing mitigation).
In doing that I ran into one more avoidable but strange issue in using xgboost: when run for a small number of rounds it at first appears that xgboost doesn’t get the unconditional average or grand average right (let alone the conditional averages Nina was working with)!
It’s not something you’ll hit very often, but if you’re trying
xgboost against a small enough data set with few enough rounds, it is something to keep in mind.
Keras is a high-level neural networks API, written in Python and capable of running on top of Tensorflow, CNTK or Theano. It was developed with a focus on enabling fast experimentation. In this blog, we are going to cover one small case study for fashion mnist.
Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28×28 grayscale image, associated with a label from 10 classes. Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.
The end result wasn’t that great, but Shubham was using a sequential model rather than a convolutional neural network, so you can probably take this as a starting point and improve upon it.
The sharp eyed will see that the data set is defined by SQL. So, does that suffer from injection attacks? Short answer is no. If there was more than one result set within the Python code, it’s going to error out. So you’re protected there.
This is important, because the data set query can be defined with parameters. You can pass values to those parameters, heck, you’re likely to pass values to those parameters, from the external query or procedure. So, is that an attack vector?
Another factor is that you need explicitly to grant
EXECUTE ANY EXTERNAL SCRIPT rights to non-sysadmin, non-db_owner users, meaning a non-privileged user can’t execute external scripts at all. You can also limit the executing service account
Let’s quickly review neural networks.
Neural networks are universal approximators. This means that with enough neurons and time, a neural network can model any input/output relationship, to any degree of precision.
A standard feed forward neural network receives an input (vector) and feeds it forward through hidden layers to an output. SAS PROC NNET, for example, trains a multilayer perceptron neural network. As the name “multilayer” implies, there are multiple layers. Below we see the inputs (features), one hidden layer and the output (response, target). Each neuron is simply a mathematical function.
This is a complicated topic explained well. It’s also an overview more than a tutorial.
This option is sometimes considered to be real-time as the information is provided “as its needed,” but it is still a batch method. Using our telco example, a batch process can be run at night that will make a prediction for each customer, and an operational database is updated with the most recent prediction. The call center agent software can then fetch this prediction for the customer when they call in, and the agent can take action accordingly.
Read on for more.
Apache Spark MLlib users often tune hyperparameters using MLlib’s built-in tools
TrainValidationSplit. These use grid search to try out a user-specified set of hyperparameter values; see the Spark docs on tuning for more info.
Databricks Runtime 5.3 and 5.3 ML and above support automatic MLflow tracking for MLlib tuning in Python.
With this feature, PySpark
TrainValidationSplitwill automatically log to MLflow, organizing runs in a hierarchy and logging hyperparameters and the evaluation metric. For example, calling
CrossValidator.fit()will log one parent run. Under this run,
CrossValidatorwill log one child run for each hyperparameter setting, and each of those child runs will include the hyperparameter setting and the evaluation metric. Comparing these runs in the MLflow UI helps with visualizing the effect of tuning each hyperparameter.
Hyperparameter tuning is critical for some of the more complex algorithms like random forests, gradient boosting, and neural networks.
In all TensorFrames functionality, the DataFrame is sent together with the computations graph. The DataFrame represents the distributed data, meaning in every machine there is a chunk of the data that will go through the graph operations/ transformations. This will happen in every machine with the relevant data. Tungsten binary format is the actual binary in-memory data that goes through the transformation, first to Apache Spark Java object and from there it is sent to TensorFlow Jave API for graph calculations. This all happens in the Spark Worker process, the Spark worker process can spin many tasks which mean various calculation at the same time over the in-memory data.
An interesting bit of turnabout here is that the Scala API is the underdeveloped one; normally for Spark, the Python API is the Johnny-Come-Lately version.
You have four options from which to choose: two-class classification, multi-class classification, regression, or Choose Your Own Adventure. Today, we’re going to create a two-class classification model. Incidentally, they’re not kidding about things changing in preview—last time I looked at this, they didn’t have multi-class classifiers available.
Once you select Sentiment Analysis (that is, two-class classification of text), you can figure out how to feed data to this trainer.
I think this is fine for developers who are looking to add a machine learning component as a small part of a bigger product. I don’t think it will beat a trained human using R or Python, but it’s an interesting avenue.
Rolf Tesmer explains that machine learning and DevOps aren’t oil and water (or maybe they are and we just need to stir harder):
In talking with various development teams, customers and DevOps engineers, a lot of the potential problems of meshing ML development into an enterprise DevOps process can be boiled down to a few different areas this aims to address…
– ML stack might be different from rest of the application stack
– Testing accuracy of ML model
– ML code is not always version controlled
– Hard to reproduce models (ie explainability)
– Need to re-write featurizing + scoring code into different languages
– Hard to track breaking changes
– Difficult to monitor models & determine when to retrain
So DevOps helps with this, right? Right?
Well er, some of them yes, but not all.
DevOps is not a panacea but it can solve certain types of problems well.