Press "Enter" to skip to content

Category: Data Science

Time-Series Analysis With Box-Jenkins

The folks at Knoyd walk us through time series analysis using the Box-Jenkins method:

However, this approach is not generally recommended so we have to find something more appropriate. One option could be forecasting with the Box-Jenkins methodology. In this case, we will use the SARIMA (Seasonal Auto Regressive Integrated Moving Average) model. In this model, we have to find optimal values for seven parameters:

  • Auto Regressive Component (p)
  • Integration Component (d)
  • Moving Average Component (q)
  • Seasonal Auto Regressive Component (P)
  • Seasonal Integration Component (D)
  • Seasonal Moving Average Component (Q)
  • Length of Season (s)

To set these parameters properly you need to have knowledge of auto-correlation functions and partial auto-correlation functions.

Read on for a nice overview of this method, as well as the importance of making sure your time series data set is stationary.

Comments closed

Using MLFlow For Binary Classification In Keras

Jules Damji walks us through classifying movie reviews as positive or negative reviews, building a neural network via Keras on MLFlow along the way:

François’s code example employs this Keras network architectural choice for binary classification. It comprises of three Dense layers: one hidden layer (16 units), one input layer (16 units), and one output layer (1 unit), as show in the diagram. “A hidden unit is a dimension in the representation space of the layer,” Chollet writes, where 16 is adequate for this problem space; for complex problems, like image classification, we can always bump up the units or add hidden layers to experiment and observe its effect on accuracy and loss metrics (which we shall do in the experiments below).

While the input and hidden layers use relu as an activation function, the final output layer uses sigmoid, to squash its results into probabilities between [0, 1]. Anything closer to 1 suggests positive, while something below 0.5 can indicate negative.

With this recommended baseline architecture, we train our base model and log all the parameters, metrics, and artifacts. This snippet code, from module models_nn.py, creates a stack of dense layers as depicted in the diagram above.

The overall accuracy is pretty good—I ran through a sample of 2K reviews from the set with Naive Bayes last night for a presentation and got 81% accuracy, so the neural network getting 93% isn’t too surprising.  Seeing the confusion matrix in this demo would have been a nice addition.

Comments closed

Dealing With Multicollinearity With R

Chaitanya Sagar explains the concept of multicollinearity in linear regressions and how we can mitigate this issue in R:

Perfect multicollinearity occurs when one independent variable is an exact linear combination of other variables. For example, you already have X and Y as independent variables and you add another variable, Z = a*X + b*Y, to the set of independent variables. Now, this new variable, Z, does not add any significant or different value than provided by X or Y. The model can adjust itself to set the parameters that this combination is taken care of while determining the coefficients.

Multicollinearity may arise from several factors. Inclusion or incorrect use of dummy variables in the system may lead to multicollinearity. The other reason could be the usage of derived variables, i.e., one variable is computed from other variables in the system. This is similar to the example we took at the beginning of the article. The other reason could be taking variables which are similar in nature or which provide similar information or the variables which have very high correlation among each other.

Multicollinearity can make regression analysis trickier, and it’s worth knowing about.  H/T R-bloggers.

Comments closed

Principal Component Analysis With Faces

Mic at The Beginner Programmer shows us how to creepy PCA diagrams with human faces:

PCA looks for a new the reference system to describe your data. This new reference system is designed in such a way to maximize the variance of the data across the new axis. The first principal component accounts for as much variance as possible, as does the second and so on. PCA transforms a set of (tipically) correlated variables into a set of uncorrelated variables called principal components. By design, each principal component will account for as much variance as possible. The hope is that a fewer number of PCs can be used to summarise the whole dataset. Note that PCs are a linear combination of the original data.

The procedure simply boils down to the following steps

  1. Scale (normalize) the data (not necessary but suggested especially when variables are not homogeneous).

  2. Calculate the covariance matrix of the data.

  3. Calculate eigenvectors (also, perhaps confusingly, called “loadings”) and eigenvalues of the covariance matrix.

  4. Choose only the first N biggest eigenvalues according to one of the many criteria available in the literature.

  5. Project your data in the new frame of reference by multipliying your data matrix by a matrix whose columns are the N eigenvectors associated with the N biggest eigenvalues.

  6. Use the projected data (very confusingly called “scores”) as your new variables for further analysis.

I like the explanations provided, and the data set is definitely something I’m not used to seeing with PCA.  H/T R-bloggers

Comments closed

Using Uncertainty For Model Interpretation

Yoel Zeldes and Inbar Naor explain how uncertainty can help you understand your models better:

One prominent example is that of high risk applications. Let’s say you’re building a model that helps doctors decide on the preferred treatment for patients. In this case we should not only care about the accuracy of the model, but also about how certain the model is of its prediction. If the uncertainty is too high, the doctor should to take this into account.

Self-driving cars are another interesting example. When the model is uncertain if there is a pedestrian on the road we could use this information to slow the car down or trigger an alert so the driver can take charge.

Uncertainty can also help us with out of data examples. If the model wasn’t trained using examples similar to the sample at hand it might be better if it’s able to say “sorry, I don’t know”. This could have prevented the embarrassing mistake Google photos had when they misclassified African Americans as gorillas. Mistakes like that sometimes happen due to an insufficiently diverse training set.

The last usage of uncertainty, which is the purpose of this post, is as a tool for practitioners to debug their model. We’ll dive into this in a moment, but first, let’s talk about different types of uncertainty.

Interesting argument.

Comments closed

Naive Bayes In Python

Kislay Keshari explains the Naive Bayes algorithm and shows an implementation in Python:

Naive Bayes in the Industry

Now that you have an idea of what exactly Naive Bayes is and how it works, let’s see where it is used in the industry.

RSS Feeds

Our first industrial use case is News Categorization, or we can use the term ‘text classification’ to broaden the spectrum of this algorithm. News on the web is rapidly growing where each news site has its own different layout and categorization for grouping news. Companies use a web crawler to extract useful text from HTML pages of news articles to construct a Full Text RSS. The contents of each news article is tokenized (categorized). In order to achieve better classification results, we remove the less significant words, i.e. stop, from the document. We apply the naive Bayes classifier for classification of news content based on news code.

It’s a good overview of the topic and a particular implementation in Python.  Naive Bayes is a technique which you want in the bag:  there are a lot of techniques which tend to be better in specific domains, but Naive Bayes is easy to implement and usually provides acceptable performance.

Comments closed

Bayesian Approaches To The Cold Start Problem

John Cook explains what you can do with data-driven applications when you don’t yet have the data:

How do you operate a data-driven application before you have any data? This is known as the cold start problem.

We faced this problem all the time when I designed clinical trials at MD Anderson Cancer Center. We used Bayesian methods to design adaptive clinical trial designs, such as clinical trials for determining chemotherapy dose levels. Each patient’s treatment assignment would be informed by data from all patients treated previously.

But what about the first patient in a trial? You’ve got to treat a first patient, and treat them as well as you know how. They’re struggling with cancer, so it matters a great deal what treatment they are assigned. So you treat them according to expert opinion. What else could you do?

Read on for John’s solution.

Comments closed

A Geometric Depiction Of Covariance

Nikolai Janakiev explains the concept of the covariance matrix using a bit of Python and some graphs:

In this article we saw the relationship of the covariance matrix with linear transformation which is an important building block for understanding and using PCASVD, the Bayes Classifier, the Mahalanobis distance and other topics in statistics and pattern recognition. I found the covariance matrix to be a helpful cornerstone in the understanding of the many concepts and methods in pattern recognition and statistics.

Many of the matrix identities can be found in The Matrix Cookbook. The relationship between SVD, PCA and the covariance matrix are elegantly shown in this question.

Understanding covariance is critical for a number of statistical techniques, and this is a good way of describing it.

Comments closed

Calculating Cohort Lifetime Value With Excel And R

Eleni Markou shows how to calculate the lifetime value of a group of customers using two techniques:

A lot of ink has been spilled in developing various descriptions of the LTV, the majority of which ends up with mathematical formulas that are based on margin (m), retention rate (r) and discount rate (d) like the following (here):

However, this model appears to be not that realistic as it is based on a few quite restrictive assumptions:

  • Retention is assumed to be constant during the lifetime of a customer, i.e. the probability r of remaining retained remains the same across all months.
  • An infinite time horizon is assumed when calculating the present value of future cash flows.
  • The unit economics are supposed to be constant throughout lifetime which leads to a constant contribution margin.

Yet when dealing with an actual company, it easily becomes evident that none of the aforementioned conditions actually hold. Especially in early-stage businesses the size of the time periods across which you would like to calculate the LTV is month – or week – sized while at the same time the retention rate across them can vary significantly as the company’s products evolve quickly.

There’s a lot packed into that article, so give it a read.

Comments closed