Press "Enter" to skip to content

Category: Data Science

Reducing Dimensionality

Antoine Guillot explains some of the basic concepts of variable reduction in a data analysis:

Each of these people can be represented as points in a 3 Dimensional space. With a gross approximation, each people is in a 50*50*200 (cm) cube. If we use a resolution of 1cm and three color channels, then can be represented by 1,000,000 variables.
On the other hand, the shadow is only in 2 dimensions and in black and white, so each shadow only needs 50*200=10,000 variables.
The number of variables was divided by 100 ! And if your goal is to detect human vs cat, or even men vs women, the data from the shadow may be enough.

Read on for intuitive discussions of techniques like principal component analysis and linear discriminant analysis.  H/T R-Bloggers

Comments closed

Random Forests In R

Anish Sing Walia explains the basics of random forests and provides sample code in R:

Random Forests are similar to a famous Ensemble technique called Bagging but have a different tweak in it. In Random Forests the idea is to decorrelate the several trees which are generated on the different bootstrapped samples from training Data.And then we simply reduce the Variance in the Trees by averaging them.
Averaging the Trees helps us to reduce the variance and also improve the Perfomance of Decision Trees on Test Set and eventually avoid Overfitting.

The idea is to build lots of Trees in such a way to make the Correlation between the Trees smaller.

Random forests frequently give a good answer to classification problems, enough so as to make them a nice starting point.

Comments closed

Neural Networks From Scratch

Ilia Karmanov explains neural nets and shows how to build one in R:

Hence, my motivation for this post is two-fold:

  1. Understanding (by writing from scratch) the leaky abstractions behind neural-networks dramatically shifted my focus to elements whose importance I initially overlooked. If my model is not learning I have a better idea of what to address rather than blindly wasting time switching optimisers (or even frameworks).
  2. A deep-neural-network (DNN), once taken apart into lego blocks, is no longer a black-box that is inaccessible to other disciplines outside of AI. It’s a combination of many topics that are very familiar to most people with a basic knowledge of statistics. I believe they need to cover very little (just the glue that holds the blocks together) to get an insight into a whole new realm.

Starting from a linear regression we will work through the maths and the code all the way to a deep-neural-network (DNN) in the accompanying R-notebooks. Hopefully to show that very little is actually new information.

This is pretty detailed.  Karmanov mentions Andrej Karpathy, whose Hacker’s guide to Neural Networks is also a must-read on the topic.

Comments closed

Using bsts In R

Steven L. Scott explains what the bsts package does:

Time series data appear in a surprising number of applications, ranging from business, to the physical and social sciences, to health, medicine, and engineering. Forecasting (e.g. next month’s sales) is common in problems involving time series data, but explanatory models (e.g. finding drivers of sales) are also important. Time series data are having something of a moment in the tech blogs right now, with Facebook announcing their “Prophet” system for time series forecasting (Taylor and Letham 2017), and Google posting about its forecasting system in this blog (Tassone and Rohani 2017).

This post summarizes the bsts R package, a tool for fitting Bayesian structural time series models. These are a widely useful class of time series models, known in various literatures as “structural time series,” “state space models,” “Kalman filter models,” and “dynamic linear models,” among others. Though the models need not be fit using Bayesian methods, they have a Bayesian flavor and the bsts package was built to use Bayesian posterior sampling.

If you’re looking for time series models, this looks like a good one.

Comments closed

Measuring Model Accuracy

Fabio Veronesi shows several methods of testing model accuracy:

Mean Squared Deviation or Mean Squared Error

This is simply the numerator of the previous equation, but it is not used often. The issue with both the RMSE and the MSE is that since they square the residuals they tend to be more affected by large residuals. This means that even if our model explains the large majority of the variation in the data very well, with few exceptions; these exceptions will inflate the value of RMSE.

Click through for several calculations.  H/T R-bloggers

Comments closed

Regularization Prevents Overfitting

Hui Li has an explanation of what regularization is and how it works to reduce the likelihood of overfitting training data:

Assume that the red line is the regression model we learn from the training data set. It can be seen that the learned model fits the training data set perfectly, while it cannot generalize well to the data not included in the training set. There are several ways to avoid the problem of overfitting.

To remedy this problem, we could:

  • Get more training examples.
  • Use a simple predictor.
  • Select a subsample of features.

In this blog post, we focus on the second and third ways to avoid overfitting by introducing regularization on the parameters βi of the model.

Read the whole thing.

Comments closed

Loan Chargeoff Templates

Ajay Jagannathan announces a couple new Cortana Intelligence Solutions Gallery templates:

For more information, read this blog: End to End Loan ChargeOff Prediction Built Using Azure HDInsight Spark Clusters and SQL Server 2016 R Service

We have published two solution templates deployable using two technology stacks for the above chargeoff scenario:-

  1. Loan Chargeoff Prediction using SQL Server 2016 R Services – Using DSVM with SQL Server 2016 and Microsoft ML, this solution template walks through how to create and clean up a set of simulated data, use 5 different models to train, select the best performant model, perform scoring using the model and save the prediction results back to SQL Server. A PowerBI report connects to the prediction table and show interactive reports with the user on the chargeoff prediction.

  2. Loan Chargeoff Prediction using HDInsight Spark Clusters – This solution demonstrates how to develop machine learning models for predicting loan chargeoff (including data processing, feature engineering, training and evaluating models), deploy the models as a web service (on the edge node) and consume the web service remotely with Microsoft R Server on Azure HDInsight Spark clusters. The final predictions is saved to a Hive table which could be visualized in Power BI.

These tend to be nice because they show you how the different pieces of the Azure stack tie together.

Comments closed

What’s Wrong With CRISP-DM

Jen Stirrup explains the issues with CRISP-DM, a model for data mining:

The model no longer seems to be actively maintained. At the time of writing, the official site, CRISP-DM.org, is no longer being maintained. Further, the framework itself has not been updated on issues on working with new technologies, such as Big Data.

Jen then contrasts with with Microsoft’s Team Data Science Process framework; click through for that.

Comments closed

Mann-Whitney U Test in SQL

Phil Factor continues his Statistics in SQL series with the Mann-Whitney U test:

There are several ways that you can test this, but nobody is going to argue with you if you use a Mann–Whitney U test to test whether two samples come from the same distribution. It doesn’t require that the data has any particular type of distribution. It just requires that each observation is done by a different member of the population so that all the observations from both groups are independent of each other. It is really just a test of differences in mean-rank between two populations’ pooled ranking. To test this difference It has to be possible to compare any of the observations with any of the others and say which of the two are greater. Your objective is to disprove the assumption that The distributions of both populations are equal. Calculating a measure of the difference is simple, and was designed to be done easily by hand before computers. The probability that the observed difference occurred by chance is easily calculated for large samples because U then approximates to the normal distribution, but it is complex for small samples. Here, we have a small sample and are just interested in whether the two-tailed test is signifcant at the five percent level so we dodge the bullet by using a significance lookup table for the critical value of U.

Read on for Phil’s implementation of the test.

Comments closed

Neural Nets On Spark

Nisha Muktewar and Seth Hendrickson show how to use Deeplearning4j to build deep learning models on Hadoop and Spark:

Modern convolutional networks can have several hundred million parameters. One of the top-performing neural networks in the Large Scale Visual Recognition Challenge (also known as “ImageNet”), has 140 million parameters to train! These networks not only take a lot of compute and storage resources (even with a cluster of GPUs, they can take weeks to train), but also require a lot of data. With only 30000 images, it is not practical to train such a complex model on Caltech-256 as there are not enough examples to adequately learn so many parameters. Instead, it is better to employ a method called transfer learning, which involves taking a pre-trained model and repurposing it for other use cases. Transfer learning can also greatly reduce the computational burden and remove the need for large swaths of specialized compute resources like GPUs.

It is possible to repurpose these models because convolutional neural networks tend to learn very general features when trained on image datasets, and this type of feature learning is often useful on other image datasets. For example, a network trained on ImageNet is likely to have learned how to recognize shapes, facial features, patterns, text, and so on, which will no doubt be useful for the Caltech-256 dataset.

This is a longer post, but on an extremely interesting topic.

Comments closed