Press "Enter" to skip to content

Category: Data Science

The Joy of Decision Trees

Tom Jordan explains how a simple set of “if” statements forms the basis of some powerful data science algorithms:

While it is true there are techniques in machine learning that required advanced maths knowledge, some of the most widely used approaches make use of knowledge given to every child at secondary school. The line of best fit, drawn by many a student in Year 8 Chemistry, can also be known by its alter-ego, linear regression, and see applications all over machine learning. Neural networks, central to some of the most cutting-edge applications, are formed of simple mathematical models consisting of some addition and multiplication.

A personal favourite technique, and the subject of this blog, is the humble decision tree, taught in schools all over the country. This blog will take a high-level look at the theory around decision trees, an extension using random forests, and the real-world applications of these techniques.

Read on for more.

Comments closed

Text Processing Tools and Methods

Ines Roldos takes us through several tools and techniques used in text processing:

Text processing is the process of analyzing and manipulating textual information. This includes extracting smaller bits of information from text (aka text extraction), assign values or tags depending on its content (aka text classification), or performing calculations that depend on the textual information. 

Since we naturally communicate in words, not numbers, companies receive a lot of raw text data via emails, chat conversations, social media, and other channels. This unstructured data is filled with insights and opinions about different topics, products, and services, but companies first need to organize, sort, and measure textual data to get access to this valuable information. One way to process text data is manually, which has been the most popular method – up until now.

We’re still in the early days of text processing, but there have been some nice improvements over the past decade.

Comments closed

Cross-Validation Versus Regularization

Nina Zumel takes us through a pair of techniques for avoiding overfitting:

Cross-validation is relatively computationally expensive; regularization is relatively cheap. Can you mitigate nested model bias by using regularization techniques instead of cross-validation?

The short answer: no, you shouldn’t. But as, we’ve written before, demonstrating this is more memorable than simply saying “Don’t do that.”

Definitely worth the read.

Comments closed

Using pdqr for Statistical Uncertainty

Evgeni Chasnovski has a new CRAN package:

I am glad to announce that my latest, long written R package ‘pdqr’ is accepted to CRAN. It provides tools for creating, transforming and summarizing custom random variables with distribution functions (as base R ‘p*()’, ‘d*()’, ‘q*()’, and ‘r*()’ functions). You can read a brief overview in one of my previous posts.

Click through for a description of the package.

Comments closed

Important Assumptions with Linear Models

Sebastian Sauer takes us through two of the most important assumptions of linear models:

Additivity and linearity as the second most important assumptions in linear models
We assume that \(y\) is a linear function of the predictors. If y is not a linear function of the predictors, we cannot expect the model to deliver correct insights (predictions, causal coefficients). Let’s check an example.

Read on to understand what this means, as well as the most important assumption.

Comments closed

Mocking Objects with R

The R-hub blog has an interesting post on creating mocks in R for unit testing:

In some of these cases, the programming concept you’re after is mocking, i.e. making a function act as if something were a certain way! In this blog post we shall offer a round-up of resources around mocking, or not mocking, when unit testing an R package.

It’s interesting watching data scientists work through the same sorts of problems which traditional developers have hit, whether that be testing, deployment, or source control management. H/T R-bloggers

Comments closed

Re-Introducing rquery

John Mount has a new introduction to rquery:

rquery is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of R’s base::transform(), or dplyr’s dplyr::mutate() and uses a pipe in the style popularized in R with magrittr. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional SQL “window functions.” More on the background and context of rquery can be found here.

The R/rquery version of this introduction is here, and the Python/data_algebra version of this introduction is here.

Check it out.

Comments closed

Evaluating a Classification Model with a Spam Filter

John Mount shares an extract from Mount and Nina Zumel’s Practical Data Science with R, 2nd Edition:

This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.

It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a side-effect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.

Click through for that extract. I liked the first edition of the book, so I’m looking forward to the 2nd.

Comments closed

Training, Validation, and Test Data Sets with SAS Viya

Beth Ebersole takes us through creating training, validation, and test data sets using SAS Viya:

Training data are used to fit each model. Training a model involves using an algorithm to determine model parameters (e.g., weights) or other logic to map inputs (independent variables) to a target (dependent variable). Model fitting can also include input variable (feature) selection. Models are trained by minimizing an error function.

For illustration purposes, let’s say we have a very simple ordinary least squares regression model with one input (independent variable, x) and one output (dependent variable, y). Perhaps our input variable is how many hours of training a dog or cat has received, and the output variable is the combined total of how many fingers or limbs we will lose in a single encounter with the animal.

Read on for some good notes, including the difference between mean squared error and average squared error.

Comments closed

Multiple Hypothesis Testing with R

Roland Stevenson shows how we can perform multiple hypothesis tests on data, as well as potential issues:

Both results show that evaluating two tests on the same family of data will lead to a ~10% chance that a researcher will claim a “significant” result if they look for either test to reject the null. Any claim there is a maximum 5% false positive rate would be mistaken. As an exercise, verify that doing the same on \(m=4\) tests will lead to an ~18% chance!

A bad testing platform would be one that claims a maximum 5% false positive rate when any one of multiple tests on the same family of data show significance at the 5% level. Clearly, if a researcher is going to claim that the FWER is no more than \(\alpha\), then they must control for the FWER and carefully consider how individual tests reject the null.

This is worth taking some time to read carefully. H/T R-Bloggers

Comments closed