The AUC can be defined as “The probability that a randomly selected case will have a higher test result than a randomly selected control”. Let’s use this definition to calculate and visualize the estimated AUC.
In the figure below, the cases are presented on the left and the controls on the right.
Since we have only 12 patients, we can easily visualize all 32 possible combinations of one case and one control. (Rcode below)
Expanding from this easy-to-follow example, Colman walks us through some of the statistical tests involved. Check it out.
The more data that is used to train the classifier, the more accurate it will become over time. So if we continue to train it with actual results in 2017, then what it predicts in 2018 will be more accurate. Also, when Bayes gives a prediction, it will attach a probability. So it may answer the above question as follows: “Based on past data, I predict with 60% confidence that it will rain today.”
So the classifier is either in training mode or predicting mode. It is in training mode when we are teaching it. In this case, we are feeding it the outcome (the category). It is in predicting mode when we are giving it the features, but asking it what the most likely outcome will be.
My contribution is a joke that I heard last night: a Bayesian statistician hears hooves clomping the ground. He turns around and sees a tiger. Therefore, he decides that it must be a zebra. First time I’d heard that joke, and as a Bayesian zebra-spotter, I enjoyed it.
How many terms are possible? There are four basic ingredients: TP, FP, TN, and FN. So if each term may or may not be included in a sum in the numerator and denominator, that’s 16 possible numerators and 16 denominators, for a total of 256 possible terms to remember. Some of these are redundant, such as one(a.k.a. ONE), given by TP/TP, FP/FP, etc. If we insist that the numerator and denominator be different, that eliminates 16 possibilities, and we’re down to a more manageable 240 definitions. And if we rule out terms that are the reciprocals of other terms, we’re down to only 120 definitions to memorize.
And of those, John points out the handful which are generally important, providing us an excellent table with definitions of commonly-used terms.
Our main requirement for this new project was to build infrastructure that would be 100 percent self-service for our Data Scientists. In other words, my teammates and I would never be directly involved in the discovery, creation, configuration and management of the event data. Self-service would fix the primary shortcoming of our legacy event delivery system: manual administration that was performed by my team whenever a new dataset was born. This manual process hindered the productivity and access to event data for our Data Scientists. Meanwhile, fulfilling the requests of the Data Scientists hindered our own ability to improve the infrastructure. This scenario is exactly what the Data Platform Team strives to avoid. Building self-service tooling is the number one tenet of the Data Platform Team at Stitch Fix, so whatever we built to replace the old event infrastructure needed to be self-service for our Data Scientists. You can learn more about our philosophy in Jeff Magnusson’s post Engineers Shouldn’t Write ETL.
This is an architectural overview and a good read.
Here I am sharing the slides for a webinar I gave for SAP about Explaining Keras Image Classification Models with LIME.
Read on for links to additional resources as well.
However, this approach is not generally recommended so we have to find something more appropriate. One option could be forecasting with the Box-Jenkins methodology. In this case, we will use the SARIMA (Seasonal Auto Regressive Integrated Moving Average) model. In this model, we have to find optimal values for seven parameters:
- Auto Regressive Component (p)
- Integration Component (d)
- Moving Average Component (q)
- Seasonal Auto Regressive Component (P)
- Seasonal Integration Component (D)
- Seasonal Moving Average Component (Q)
- Length of Season (s)
To set these parameters properly you need to have knowledge of auto-correlation functions and partial auto-correlation functions.
Read on for a nice overview of this method, as well as the importance of making sure your time series data set is stationary.
François’s code example employs this Keras network architectural choice for binary classification. It comprises of three Dense layers: one hidden layer (16 units), one input layer (16 units), and one output layer (1 unit), as show in the diagram. “A hidden unit is a dimension in the representation space of the layer,” Chollet writes, where 16 is adequate for this problem space; for complex problems, like image classification, we can always bump up the units or add hidden layers to experiment and observe its effect on accuracy and loss metrics (which we shall do in the experiments below).
While the input and hidden layers use relu as an activation function, the final output layer uses sigmoid, to squash its results into probabilities between [0, 1]. Anything closer to 1 suggests positive, while something below 0.5 can indicate negative.
With this recommended baseline architecture, we train our base model and log all the parameters, metrics, and artifacts. This snippet code, from module models_nn.py, creates a stack of dense layers as depicted in the diagram above.
The overall accuracy is pretty good—I ran through a sample of 2K reviews from the set with Naive Bayes last night for a presentation and got 81% accuracy, so the neural network getting 93% isn’t too surprising. Seeing the confusion matrix in this demo would have been a nice addition.
Perfect multicollinearity occurs when one independent variable is an exact linear combination of other variables. For example, you already have X and Y as independent variables and you add another variable, Z = a*X + b*Y, to the set of independent variables. Now, this new variable, Z, does not add any significant or different value than provided by X or Y. The model can adjust itself to set the parameters that this combination is taken care of while determining the coefficients.
Multicollinearity may arise from several factors. Inclusion or incorrect use of dummy variables in the system may lead to multicollinearity. The other reason could be the usage of derived variables, i.e., one variable is computed from other variables in the system. This is similar to the example we took at the beginning of the article. The other reason could be taking variables which are similar in nature or which provide similar information or the variables which have very high correlation among each other.
Multicollinearity can make regression analysis trickier, and it’s worth knowing about. H/T R-bloggers.
PCA looks for a new the reference system to describe your data. This new reference system is designed in such a way to maximize the variance of the data across the new axis. The first principal component accounts for as much variance as possible, as does the second and so on. PCA transforms a set of (tipically) correlated variables into a set of uncorrelated variables called principal components. By design, each principal component will account for as much variance as possible. The hope is that a fewer number of PCs can be used to summarise the whole dataset. Note that PCs are a linear combination of the original data.
The procedure simply boils down to the following steps
Scale (normalize) the data (not necessary but suggested especially when variables are not homogeneous).
Calculate the covariance matrix of the data.
Calculate eigenvectors (also, perhaps confusingly, called “loadings”) and eigenvalues of the covariance matrix.
Choose only the first N biggest eigenvalues according to one of the many criteria available in the literature.
Project your data in the new frame of reference by multipliying your data matrix by a matrix whose columns are the N eigenvectors associated with the N biggest eigenvalues.
Use the projected data (very confusingly called “scores”) as your new variables for further analysis.
I like the explanations provided, and the data set is definitely something I’m not used to seeing with PCA. H/T R-bloggers
One prominent example is that of high risk applications. Let’s say you’re building a model that helps doctors decide on the preferred treatment for patients. In this case we should not only care about the accuracy of the model, but also about how certain the model is of its prediction. If the uncertainty is too high, the doctor should to take this into account.
Self-driving cars are another interesting example. When the model is uncertain if there is a pedestrian on the road we could use this information to slow the car down or trigger an alert so the driver can take charge.
Uncertainty can also help us with out of data examples. If the model wasn’t trained using examples similar to the sample at hand it might be better if it’s able to say “sorry, I don’t know”. This could have prevented the embarrassing mistake Google photos had when they misclassified African Americans as gorillas. Mistakes like that sometimes happen due to an insufficiently diverse training set.
The last usage of uncertainty, which is the purpose of this post, is as a tool for practitioners to debug their model. We’ll dive into this in a moment, but first, let’s talk about different types of uncertainty.