Category: Data Science

Let’s look at a concrete example with the Click-Through Rate Prediction dataset of ad impressions and clicks from the data science website Kaggle. The goal of this workflow is to create a machine learning model that, given a new ad impression, predicts whether or not there will be a click.

To build our advanced analytics workflow, let’s focus on the three main steps:

ETL
Data Exploration, for example, using SQL
Advanced Analytics / Machine Learning

The Databricks blog has a couple other examples, but this was the most interesting one for me.

Comments closed

Generating Basic Features From Text Data In R With textfeatures

Published 2018-07-24 by Kevin Feasel

Abdul Majed Raja demonstrates the textfeatures package in R:

Michael Kearney, Assistant Professor in University of Missouri, well known in the R community for the modern twitter package rtweet, has come up with a new R packaged called textfeatures that basically generates a bunch of features for any text data that you supply. Before you dream of Deep Learning based Package for Automated Text Feature Engineering, This isn’t that. This uses very simple Text Analysis principles and generates features like Number of Upper Case letters, Number of Punctuations – plain simple stuff and nothing fancy but pretty useful ones.

It’s a start for text analysis, though there’s a lot more after this.

Comments closed

Using word2vec To Model User Behavior

Published 2018-07-23 by Kevin Feasel

Nishan Subedi walks us through an Etsy project to model user journeys via semantic embedding techniques:

We initially started training the embeddings as a Skip-gram model with negative sampling (NEG as outlined in the original word2vec paper) method. The Skip-gram model performs better than the Continuous Bag Of Words (CBOW) model for larger vocabularies. It models the context given a target token and attempts to maximize the average likelihood of seeing any of the context tokens given a target token. The negative sampling draws a negative token from the entire corpus with a frequency that is directly proportional to the frequency of the token appearing in the corpus.

Training a Skip-gram model on only randomly selected negatives, however, ignores implicit contextual signals that we have found to be indicative of user preference in other contexts. For example, if a user clicks on the second item for a search query, the user most likely saw, but did not like, the first item that showed up in the search results. We extend the Skip-gram loss function by appending these implicit negative signals to the Skip-gram loss directly.

Similarly, we consider the purchased item in a particular session to be a global contextual token that applies to the entire sequence of user interactions. The intuition behind this is that there are many touch points on the user’s journey that help them come to the final purchase decision, and so we want to share the purchase intent across all the different actions that they took. This is also referred to as the linear multi-touch attribution model.

This is a very interesting article, and their attempt at getting around the problem of unexpected explosive growth in demand.

Comments closed

Testing Spatial Equilibrium Concepts With tidycensus

Published 2018-07-16 by Kevin Feasel

Ignacio Sarmiento Barbieri walks us through the concept of spatial equilibrium and tests using data from the tidycensus package:

Let’s take the model to the data and reproduce figures 2.1. and 2.2 of “Cities, Agglomeration, and Spatial Equilibrium”. The focus are two cities, Chicago and Boston. These cities are chosen because both differ in how easy is to access to their city centers. Chicago is fairly easy, Boston is more complicated. Our model then implies that gradients then should reflect the differential costs to access the city centers.

So let’s begin, the first step is to get some data. To do so I’m are going to use the “tidycensus” package. This package will allow me to get data from the census website using their API. We are also going to need the help of three other packages: “sf” to handle spatial data, “dplyr” my go-to package to wrangle data, and “ggplot2” to plot my results.
require("tidycensus", quietly=TRUE)
require("sf", quietly=TRUE)
require("dplyr", quietly=TRUE)
require("ggplot2", quietly=TRUE)
In order to get access to the Census API, I need to supply a key, which can be obtained from http://api.census.gov/data/key_signup.html.

Read on for theory and a test. H/T R-bloggers

Comments closed

Interacting With SQL Server From Pandas

Published 2018-07-16 by Kevin Feasel

Tomaz Kastrun shows how to use pyodbc to interact with a SQL Server database from Pandas:

In the SQL Server Management Studio (SSMS), the ease of using external procedure sp_execute_external_script has been (and still will be) discussed many times. But the reason for this short blog post is the fact that, changing Python environments using Conda package/module management within Microsoft SQL Server (Services), is literally impossible. Scenarios, where you want to build a larger set of modules (packages) but are impossible to be compatible with your SQL Server or Conda, then you would need to set up a new virtual environment and start using Python from there.

Communicating with database to load the data into different python environment should not be a problem. Python Pandas module is an easy way to store dataset in a table-like format, called dataframe. Pandas is very powerful python package for handling data structures and doing data analysis.

Click through for examples of reading and writing data.

Comments closed

When Image Classifiers Look At Unknown Objects

Published 2018-07-13 by Kevin Feasel

Pete Warden explains that image classifiers aren’t magic:

As people, we’re used to being able to classify anything we see in the world around us, and we naturally expect machines to have the same ability. Most models are only trained to recognize a very limited set of objects though, such as the 1,000 categories of the original ImageNet competition. Crucially, the training process makes the assumption that every example the model sees is one of those objects, and the prediction must be within that set. There’s no option for the model to say “I don’t know”, and there’s no training data to help it learn that response. This is a simplification that makes sense within a research setting, but causes problems when we try to use the resulting models in the real world.

Back when I was at Jetpac, we had a lot of trouble convincing people that the ground-breaking AlexNet model was a big leap forward because every time we handed over a demo phone running the network, they would point it at their faces and it would predict something like “Oxygen mask” or “Seat belt”. This was because the ImageNet competition categories didn’t include any labels for people, but most of the photos with mask and seatbelt labels included faces along with the objects. Another embarrassing mistake came when they would point it at a plate and it would predict “Toilet seat”! This was because there were no plates in the original categories, and the closest white circular object in appearance was a toilet.

Read the whole thing.

Comments closed

Microsoft Research Open Data Sets

Published 2018-07-13 by Kevin Feasel

David Smith notes that there are several data sets that Microsoft Research has made available:

Other data sets of note include:

A collection of 38M tweets related to the 2012 US election
3-D capture data from individuals performing a variety of hand gestures
Infer.NET, a framework for running Bayesian inference in graphical models
Images for 1 million celebrities, and associated tags
MS MARCO, is a new large-scale dataset for reading comprehension and question answering

Click through for more information, and then check out the data sets.

Comments closed

Dealing With Heteroskedasticity

Published 2018-07-09 by Kevin Feasel

Bruno Rodrigues explains the notion of heteroskedasticity and shows ways of dealing with this issue in a linear regression:

This test shows that we can reject the null that the variance of the residuals is constant, thus heteroskedacity is present. To get the correct standard errors, we can use the vcovHC() function from the {sandwich} package (hence the choice for the header picture of this post):
lmfit %>% 
    vcovHC() %>% 
    diag() %>% 
    sqrt()
##       (Intercept)   regionnortheast       regionsouth        regionwest 
##      311.31088691       25.30778221       23.56106307       24.12258706 
##         residents   young_residents per_capita_income 
##        0.09184368        0.68829667        0.02999882
By default vcovHC() estimates a heteroskedasticity consistent (HC) variance covariance matrix for the parameters. There are several ways to estimate such a HC matrix, and by default vcovHC() estimates the “HC3” one. You can refer to Zeileis (2004) for more details.

We see that the standard errors are much larger than before! The intercept and regionwest variables are not statistically significant anymore.

The biggest problem with heteroskedasticity is that it can introduce bias in error terms. That’s not the end of the world, but if the level of heteroskedasticity is serious enough, we want to find ways to account for it. H/T R-Bloggers.

Comments closed

Constrained Optimization In Python: pyomo

Published 2018-07-06 by Kevin Feasel

Jeff Schecter introduces us to pyomo, a Python package for constrained optimization problems:

Constrained optimization is a tool for minimizing or maximizing some objective, subject to constraints. For example, we may want to build new warehouses that minimize the average cost of shipping to our clients, constrained by our budget for building and operating those warehouses. Or, we might want to purchase an assortment of merchandise that maximizes expected revenue, limited by a minimum number of different items to stock in each department and our manufacturers’ minimum order sizes.

Here’s the catch: all objectives and constraints must be linear or quadratic functions of the model’s fixed inputs (parameters, in the lingo) and free variables.

Constraints are limited to equalities and non-strict inequalities. (Re-writing strict inequalities in these terms can require some algebraic gymnastics.) Conventionally, all terms including free variables live on the lefthand side of the equality or inequality, leaving only constants and fixed parameters on the righthand side.

To build your model, you must first formalize your objective function and constraints. Once you’ve expressed these terms mathematically, it’s easy to turn the math into code and let pyomo find the optimal solution.

I haven’t touched it in a decade, but I did have some success with LINGO for solving the same type of problem.

Comments closed

Area Under The ROC Is Not Accuracy

Published 2018-07-05 by Kevin Feasel

Stephen Chen debunks bad journalistic summaries of a Google research paper:

Journalists latched onto Google’s NN 0.95 score vs. the comparison 0.86 (see EWS Strawman below), as the accuracy of determining mortality. However the actual metric the researchers used is AUROC (Area Under Receiver Operating Characteristic Curve) and not a measure of predictive accuracy that indexes the difference between the predicted vs. actual like RMSE (Root Mean Squared Error) or MAPE (Mean Absolute Percentage Error). Some articles even erroneously try to explain the 0.95 as the odds ratio.

Just as the concept of significance has different meanings to statisticians and laypersons, AUROC as a measure of model accuracy does not mean the probability of Google’s NN predicting mortality accurately as journalists/laypersons have taken it to mean. The ROC (see sample above) is a plot of a model’s False Positive Rate (i.e. predicting mortality where there is none) vs. the True Positive Rate (i.e. correctly predicting mortality). A larger area under the curve (AUROC) means the model produces less False Positives, not the certainty of mortality as journalists erroneously suggest.

The researchers themselves made no claim to soothsayer abilities, what they said in the paper was:

… (their) deep learning model would fire half the number of alerts of a traditional predictive model, resulting in many fewer false positives.

It’s an interesting article and a reminder of the importance of terminological precision (something I personally am not particularly good at).

Comments closed