Variable Screening With vtreat

John Mount explains how you can use vtreat for determining variable importance:

Part of the vtreat philosophy is to assume after the vtreat variable processing the next step is a sophisticated supervised machine learningmethod. Under this assumption we assume the machine learning methodology (be it regression, tree methods, random forests, boosting, or neural nets) will handle issues of redundant variables, joint distributions of variables, overall regularization, and joint dimension reduction.
However, an important exception is: variable screening. In practice we have seen wide data-warehouses with hundreds of columns overwhelm and defeat state of the art machine learning algorithms due to over-fitting. We have some synthetic examples of this (here and here).
The upshot is: even in 2018 you can not treat every column you find in a data warehouse as a variable. You must at least perform some basic screening.

Read on to see a couple quick functions which help with this screening.

Reviewing Word Associations With R

Julia Silge does some exploratory analysis on the Small World of Words project:

The Small World of Words project focuses on word associations. You can try it out for yourself to see how it works, but the general idea is that the participant is presented with a word (from “telephone” to “journalist” to “yoga”) and is then asked to give their immediate association with that word. The project has collected more than 15 million responses to date, and is still collecting data. You can check out some pre-built visualizations the researchers have put together to explore the dataset, or you can download the data for yourself.

It’s an interesting analysis of the data set, mixed in with some good R code.

Executing Python Code In Power BI

Brad Llewellyn shows how to build a visual based on a Python script using Power BI:

Now that we’ve seen our data, it’s a relatively simple task to convert the R script to a Python script. There are a few major differences. First, Python is a general purpose programming language, whereas R is a statistical programming language. This means that some of the functionality provided in Base R requires additional libraries in Python. Pandas is a good library for data manipulation, but is already included by default in Power BI. Scikit-learn (also known as sklearn) is a good library for build predictive models. Finally, Seaborn and Matplotlib are good libraries for creating data visualizations.

In addition, there are some scenarios where Python is a bit more verbose than R, resulting in additional coding to achieve the same result. For instance, fitting a regression line to our data using the sklearn.linear_model.LinearRegression().fit() function required much more coding than the corresponding lm() function in R. Of course, there are plenty of situations where the opposite is true and R becomes the more verbose language.

Click through for the full example.

Road Construction Incentive Contracts And R

Sebastian Kranz promotes an interesting RTutor project:

Patrick Bajari and Gregory Lewis have collected a detailed sample of 466 road construction projects in Minnesota to study this question in their very interesting article Moral Hazard, Incentive Contracts and Risk: Evidence from Procurement in the Review of Economic Studies, 2014.
They estimate a structural econometric model and find that changes in contract design could substantially reduce the duration of road blockages and largely increase total welfare at only minor increases in the risk that road construction firms face.
As part of his Master Thesis at Ulm University, Claudius Schmid has generated a nice and detailed RTutor problem set that allows you to replicate the findings in an interactive fashion. You learn a lot about the structure and outcomes of the currently used contracts, the theory behind better contract design and how the structural model to assess the quantitative effects can be estimated and simulated. At the same time, you can hone your general data science and R skills.

Click through to a couple of ways to get to this RTutor project and learn a bit about building incentive contracts to modify behavior.  H/T R-Bloggers

Analyzing Customer Churn With Keras And H2O

Shirin Glander has released code pertaining to a forthcoming book chapter:

This is code that accompanies a book chapter on customer churn that I have written for the German dpunkt Verlag. The book is in German and will probably appear in February:
The code you find below can be used to recreate all figures and analyses from this book chapter. Because the content is exclusively for the book, my descriptions around the code had to be minimal. But I’m sure, you can get the gist, even without the book. 😉

Click through for the code.  This is using the venerable AT&T customer churn data set.

The Intuition Behind Principal Component Analysis

Holger von Jouanne-Diedrich gives us an intuition behind how principal component analysis (PCA) works:

Principal component analysis (PCA) is a dimension-reduction method that can be used to reduce a large set of (often correlated) variables into a smaller set of (uncorrelated) variables, called principal components, which still contain most of the information.
PCA is a concept that is traditionally hard to grasp so instead of giving you the n’th mathematical derivation I will provide you with some intuition.
Basically PCA is nothing else but a projection of some higher dimensional object into a lower dimension. What sounds complicated is really something we encounter every day: when we watch TV we see a 2D-projection of 3D-objects!

Click through for the rest of the story.

Gradient Boosting And XGBoost

Shirin Glander has another English-language transcript from a German video, this time covering gradient boosting techniques:

Let’s look at how Gradient Boosting works. Most of the magic is described in the name: “Gradient” plus “Boosting”.

Boosting builds models from individual so called “weak learners” in an iterative way. In the Random Forests part, I had already discussed the differences between Bagging and Boostingas tree ensemble methods. In boosting, the individual models are not built on completely random subsets of data and features but sequentially by putting more weight on instances with wrong predictions and high errors. The general idea behind this is that instances, which are hard to predict correctly (“difficult” cases) will be focused on during learning, so that the model learns from past mistakes. When we train each ensemble on a subset of the training set, we also call this Stochastic Gradient Boosting, which can help improve generalizability of our model.

The gradient is used to minimize a loss function, similar to how Neural Nets utilize gradient descent to optimize (“learn”) weights. In each round of training, the weak learner is built and its predictions are compared to the correct outcome that we expect. The distance between prediction and truth represents the error rate of our model. These errors can now be used to calculate the gradient. The gradient is nothing fancy, it is basically the partial derivative of our loss function – so it describes the steepness of our error function. The gradient can be used to find the direction in which to change the model parameters in order to (maximally) reduce the error in the next round of training by “descending the gradient”.

Along with neural networks, gradient boosting has become one of the dominant algorithms for machine learning, and is well worth learning about.

Spark MLflow 0.8.0 Released

Aaron Davidson and Jules Damji announce MLflow 0.8.0 on the Spark platform:

Improved MLflow UI Experience

  1. Compact Display for Metrics and Parameters: To avoid clutter and an explosion of columns for each metric or parameter, now we group them together in a single tabular column by default. That way, each runs’ parameters and metrics are listed nearby. Users can still click each parameter or metric to display it in a separate column or sort by it and customize their view this way.

  2. Nesting Runs: For nested MLflow runs, which are common in hyperparameter search or multi-step workflows, the UI will display a collapsible tree underneath each parent run. This makes it much easier to organize and visualize multi-step workflows.

  3. Labeling Runs: While MLflow gives each run a UUID by default, you can also now assign each run a name through the API. These names can also be edited in the UI.

  4. UI Persistence: The MLflow UI now remembers your filters, sorting and column setup in browser local storage so you no longer need to reconfigure the view each time.

Looks like there are some nice additions here.

Using K-Means Clustering For Anomaly Detection

Anais Dotis-Georgiou gives us an interesting use case of using k-means clustering along with InfluxDB (a time-series database) to detect anomalies in EKG data:

If you read Part Two, then you know these are the steps I used for anomaly detection with K-means:

  1. Segmentation – the process of splitting your time series data into small segments with a horizontal translation.

  2. Windowing – the action of multiplying your segmented data by a windowing function to truncate the dataset before and after the window. The term windowing gets its name from its functionality: it allows you to only see the data in the window range since everything before and after (or outside the window) is multiplied by zero. Windowing allows you to seamlessly stitch your reconstructed data together.

  3. Clustering – the task of grouping similar windowed segments and finding the centroids in the clusters. A centroid is at the center of a cluster. Mathematically, it is defined by the arithmetic mean position of all the points in the cluster.

  4. Reconstruction – the process of rebuilding your time series data. Essentially, you are matching your normal time series data to the closest centroid (the predicted centroid) and stitching those centroids together to produce the reconstructed data.

  5. Normal Error – The purpose of the Reconstruction is to calculate the normal error associated with the output of your time series prediction.

  6. Anomaly Detection – Since you know what the normal error for reconstruction is, you can now use it as a threshold for anomaly detection. Any reconstruction error above that normal error can be considered an anomaly.

Read the whole thing.  This is a really cool use case of a set of technologies along with a venerable (if sometimes troublesome) algorithm.

Using Kafka To Drive ML Predictions

Kai Waehner shows us a model architecture for using Apache Kafka to generate predictions from trained models:

Kafka applications are event based, and leverage stream processing to continuously process input data. If you’re using Kafka, then you can embed an analytic model natively in a Kafka Streams or KSQLapplication. There are various examples of Kafka Streams microservices embedding models built with TensorFlow, H2O or Deeplearning4j natively.

It is not always possible or feasible to embed analytic models directly due to architectural, security or organizational reasons. You can also choose to use RPC to perform model inference from your Kafka application (bearing in mind the the pros and cons discussed above). You can visit my project for an example of gRPC integration between a Kafka Streams microservice and locally hosted TensorFlow Serving container for making predictions with a hosted TensorFlow model.

There are a couple separate and interesting patterns here.


January 2019
« Dec