Category: Data Science

How do you operate a data-driven application before you have any data? This is known as the cold start problem.

We faced this problem all the time when I designed clinical trials at MD Anderson Cancer Center. We used Bayesian methods to design adaptive clinical trial designs, such as clinical trials for determining chemotherapy dose levels. Each patient’s treatment assignment would be informed by data from all patients treated previously.

But what about the first patient in a trial? You’ve got to treat a first patient, and treat them as well as you know how. They’re struggling with cancer, so it matters a great deal what treatment they are assigned. So you treat them according to expert opinion. What else could you do?

Read on for John’s solution.

Comments closed

A Geometric Depiction Of Covariance

Published 2018-08-07 by Kevin Feasel

Nikolai Janakiev explains the concept of the covariance matrix using a bit of Python and some graphs:

In this article we saw the relationship of the covariance matrix with linear transformation which is an important building block for understanding and using PCA, SVD, the Bayes Classifier, the Mahalanobis distance and other topics in statistics and pattern recognition. I found the covariance matrix to be a helpful cornerstone in the understanding of the many concepts and methods in pattern recognition and statistics.

Many of the matrix identities can be found in The Matrix Cookbook. The relationship between SVD, PCA and the covariance matrix are elegantly shown in this question.

Understanding covariance is critical for a number of statistical techniques, and this is a good way of describing it.

Comments closed

Calculating Cohort Lifetime Value With Excel And R

Published 2018-08-07 by Kevin Feasel

Eleni Markou shows how to calculate the lifetime value of a group of customers using two techniques:

A lot of ink has been spilled in developing various descriptions of the LTV, the majority of which ends up with mathematical formulas that are based on margin (m), retention rate (r) and discount rate (d) like the following (here):

However, this model appears to be not that realistic as it is based on a few quite restrictive assumptions:

Retention is assumed to be constant during the lifetime of a customer, i.e. the probability r of remaining retained remains the same across all months.

An infinite time horizon is assumed when calculating the present value of future cash flows.

The unit economics are supposed to be constant throughout lifetime which leads to a constant contribution margin.

Yet when dealing with an actual company, it easily becomes evident that none of the aforementioned conditions actually hold. Especially in early-stage businesses the size of the time periods across which you would like to calculate the LTV is month – or week – sized while at the same time the retention rate across them can vary significantly as the company’s products evolve quickly.

There’s a lot packed into that article, so give it a read.

Comments closed

Exploratory Time Series Analysis

Published 2018-07-31 by Kevin Feasel

The authors at Knoyd have a post on exploratory data analysis of a time series data set:

From the plot above we can clearly see that time-series has strong seasonal and trend components. To estimate the trend component we can use a function from the pandas library called rolling_mean and plot the results. If we want to make the plot more fancy and reusable for another time-series it is a good idea to make a function. We can call this function plot_moving_average.

The second part of the series promises to use Box-Jenkins to forecast future values.

Comments closed

Implementing K Nearest Neighbors In Python

Published 2018-07-30 by Kevin Feasel

Atul Harsha gives us a demo on k nearest neighbors in Python:

In order to make any predictions, you have to calculate the distance between the new point and the existing points, as you will be needing k closest points.

In this case for calculating the distance, we will use the Euclidean distance. This is defined as the square root of the sum of the squared differences between the two arrays of numbers

Specifically, we need only first 4 attributes(features) for distance calculation as the last attribute is a class label. So for one of the approach is to limit the Euclidean distance to a fixed length, thereby ignoring the final dimension.

Check it out.

Comments closed

Explaining Text Classification Models With LIME

Published 2018-07-27 by Kevin Feasel

Shirin Glander shows us how to use LIME to explain which words help us classify whether a user liked a particular item:

Okay, not a perfect score but good enough for me – right now, I’m more interested in the explanations of the model’s predictions. For this, we need to run the lime() function and give it

the text input that was used to construct the model

the trained model

the preprocessing function
explainer <- lime(clothing_reviews_train$text, 
                  xgb_model, 
                  preprocess = get_matrix)
With this, we could right away call the interactive explainer Shiny app, where we can type any text we want into the field on the left and see the explanation on the right: words that are underlined green support the classification, red words contradict them.

I hadn’t used LIME for this before, and it looks very interesting. H/T R-Bloggers

Comments closed

Visualizing Linear Regression Results

Published 2018-07-26 by Kevin Feasel

Bernardo Lares gives us a few ways of interpreting visually a linear regression result in R:

The most obvious plot to study for a linear regression model, you guessed it, is the regression itself. If we plot the predicted values vs the real values we can see how close they are to our reference line of 45° (intercept = 0, slope = 1). If we’d had a very sparse plot where we can see no clear tendency over that line, then we have a bad regression. On the other hand, if we have all our points over the line, I bet you gave the model your wished results!

Then, the Adjusted R2 on the plot gives us an easy parameter for us to compare models and how well did it fits our reference line. The nearer this value gets to 1, the better. Without getting too technical, if you add more and more useless variables to a model, this value will decrease; but, if you add useful variables, the Adjusted R-Squared will improve.

We also get the RMSE and MAE (Root-Mean Squared Error and Mean Absolute Error) for our regression’s results. MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. On the other side we have RMSE, which is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation. Both metrics can range from 0 to ∞ and are indifferent to the direction of errors. They are negatively-oriented scores, which means lower values are better.

I like this approach to explaining models.

Comments closed

Data Engineering Remains As Important As Ever

Published 2018-07-25 by Kevin Feasel

Prashanth Southekal has good news for ETL developers:

While many companies have embarked on data analytics initiatives, only a few have been successful. Studies have shown that over 70% of data analytics programs fail to realize their full potential and over 80% of the digital transformation initiatives fail. While there are many reasons that affect successful deployment of data analytics, one fundamental reason is lack of good quality data. However, many business enterprises realize this and invest considerable time and effort in data cleansing and remediation; technically known as data engineering. It is estimated that about 60 to 70% of the effort in data analytics is on data engineering. Given that data quality is an essential requirement for analytics, there are 5 key reasons on why data analytics is heavy on data engineering.

1.Different systems and technology mechanisms to integrate data.

Business systems are designed and implemented for a purpose; mainly for recording business transactions. The mechanisms for data capture in Business systems such as ERP is batch/discrete data while in the SCADA/IoT Field Systems it is for continuous/time-series data. This means that these business systems store diverse data types caused by the velocity, volume, and variety dimensions in the data. Hence the technology (including the database itself) to capture data is varied and complex. And when you are trying to integrate data from these diverse systems from different vendors, the metadata model varies resulting in data integration challenges.

That 60-70% on data engineering is probably a moderate underestimate.

Comments closed

Analyzing Clickstream Data With Spark

Published 2018-07-25 by Kevin Feasel

Tony Cruz and Denny Lee analyze advertising data in Spark and predict click counts given certain input features:

Let’s look at a concrete example with the Click-Through Rate Prediction dataset of ad impressions and clicks from the data science website Kaggle. The goal of this workflow is to create a machine learning model that, given a new ad impression, predicts whether or not there will be a click.

To build our advanced analytics workflow, let’s focus on the three main steps:

ETL
Data Exploration, for example, using SQL
Advanced Analytics / Machine Learning

The Databricks blog has a couple other examples, but this was the most interesting one for me.

Comments closed

Generating Basic Features From Text Data In R With textfeatures

Published 2018-07-24 by Kevin Feasel

Abdul Majed Raja demonstrates the textfeatures package in R:

Michael Kearney, Assistant Professor in University of Missouri, well known in the R community for the modern twitter package rtweet, has come up with a new R packaged called textfeatures that basically generates a bunch of features for any text data that you supply. Before you dream of Deep Learning based Package for Automated Text Feature Engineering, This isn’t that. This uses very simple Text Analysis principles and generates features like Number of Upper Case letters, Number of Punctuations – plain simple stuff and nothing fancy but pretty useful ones.

It’s a start for text analysis, though there’s a lot more after this.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31