Press "Enter" to skip to content

Category: Data Science

Calculating Consistency of Ratings

Sebastian Sauer looks at computing reliability between raters:

Computing inter-rater reliability is a well-known, albeit maybe not very frequent task in data analysis. If there’s only one criteria and two raters, the proceeding is straigt forward; Cohen’s Kappa is the most widely used coefficient for that purpose. It is more challenging to compare multiple raters on one criterion; Fleiss’ Kappa is one way to get a coefficient. If there are multiple criteria, one way is to compute the mean of multiple Fleiss’ coefficients.

However, a different way, and the way presented in this post, consists of checking of all raters agree on one given item (and repeating that for all items). If rater A assigns two tags/criteria (tag1, tag2) to item A, then the other raters may not assign different tags (eg tag3, tag4) to that item, if a match should be scored. Note that this proceeding allows for different numbers of tags/criteria for the items (eg., item 1 with only 1 tag, but item 2 with 3 tags etc.). However, our grading should give some points, if, say, rater1 assigns tag1 and tag2, but raters 2 and 3 only assign tag1.

Read the whole thing.

Comments closed

Sampling and Estimating Rare Events

Yi Liu takes us through a process to estimate rare events:

Naturally, we get an unbiased estimate of the overall prevalence of violation if we sample the videos uniformly from the population and have them reviewed by human raters to estimate the proportion of violating videos. We also get an unbiased estimate of the violation rate in each policy vertical. But given the low probability of violation and wanting to use our rater capacity wisely, this is not an adequate solution — we typically have too few positive labels in uniform samples to achieve an accurate estimate of the prevalence, especially for those sensitive policy verticals. To obtain a relative error of no more than 20%, we need roughly 100 positive labels, and more often than not, we have zero violation videos in the uniform samples for rarer policies.

This is similar in nature to testing for rare diseases, where a random sample of N people in the population is likely to turn up 0 cases of it.

Comments closed

MAPE and Its Flaws

Jan Fischer takes us through Mean Absolute Percentage Error as a measure of forecast quality:

Particular small actual values bias the MAPE.
If any true values are very close to zero, the corresponding absolute percentage errors will be extremely high and therefore bias the informativity of the MAPE (Hyndman & Koehler 2006). The following graph clarifies this point. Although all three forecasts have the same absolute errors, the MAPE of the time series with only one extremely small value is approximately twice as high as the MAPE of the other forecasts. This issue implies that the MAPE should be used carefully if there are extremely small observations and directly motivates the last and often ignored the weakness of the MAPE.

Jan also points out a couple of things people criticize MAPE for incorrectly, but several things for which it is actually guilty. It’s not a bad measure if you can make certain data assumptions, but Jan has a few alternatives which tend to be better than MAPE.

Comments closed

Calculating AUC in R

Andrew Treadway shows how you can calculate Area Under the Curve in R:

AUC is an important metric in machine learning for classification. It is often used as a measure of a model’s performance. In effect, AUC is a measure between 0 and 1 of a model’s performance that rank-orders predictions from a model. For a detailed explanation of AUC, see this link.

Since AUC is widely used, being able to get a confidence interval around this metric is valuable to both better demonstrate a model’s performance, as well as to better compare two or more models. For example, if model A has an AUC higher than model B, but the 95% confidence interval around each AUC value overlaps, then the models may not be statistically different in performance. We can get a confidence interval around AUC using R’s pROC package, which uses bootstrapping to calculate the interval.

There are plenty of ways to calculate this useful metric, but this is definitely one of the easier methods. H/T R-bloggers

Comments closed

Python versus R (Again)

Alex Woodie looks at whether Python is dominating R in the data science space:

There is some evidence that Python’s popularity is hurting R usage. According to the TIOBE Index, Python is currently the third most popular language in the world, behind perennial heavyweights Java and C. From August 2018 to August 2019, Python usage surged by more than 3% to achieve a 10% rating (TIOBE’s proprietary metric that primarily measures search activity), easily the biggest gain among the 20 most popular languages.

R, by contrast, has not fared well lately on the TIOBE Index, where it dropped from 8th place in January 2018 to become the 20th most popular language today, behind Perl, Swift, and Go. At its peak in January 2018, R had a popularity rating of about 2.6%. But today it’s down to 0.8%, according to the TIOBE index.

I’ll say that rumors of R’s demise are premature.

Comments closed

Contrasting Logistic Regression and Decision Trees

Shital Katkar explains cases when you might use logistic regression or decision trees for classification problems:

Categorical data works well with Decision Trees, while continuous data work well with Logistic Regression.

If your data is categorical, then Logistic Regression cannot handle pure categorical data (string format). Rather, you need to convert it into numerical data.

Each algorithm has its own uses and assumptions.

Comments closed

Options with stats::density() in R

Evgeni Chasnovski takes us through what the parameters in the stats::density() R function do:

Argument bw is responsible for computing bandwith of kernel density estimation: one of the main parameters that greatly affect the output. It can be specified as either algorithm of computation or directly as number. Because actual bandwidth is computed as adjust*bw(adjust is another density() argument, which is explored in the next section), here we will see how different algorithms compute bandwidths, and the effect of changing numeric value of bandwidth will be shown in section about adjust.

There are 5 available algorithms: “nrd0”, “nrd”, “ucv”, “bcv”, “SJ”. 

Evgeni has also created animations for each of these, so it’s easy to see what they do compared to the default output.

Comments closed

Naive Bayes Predictions with Analysis Services

Dinesh Asanka shows how you can use the Naive Bayes algorithm in an Analysis Services data mining project:

Microsoft Naive Bayes is a classification supervised learning. This data set can be bi-class which means it has only two classes. Whether the patient is suffering from dengue or not or whether your customers are bike buyers or not, are an example of the bi-class data set. There can be multi-class data set as well.

Let us take the example which we discussed in the previous article, AdventureWorks bike buyer example. In this example, we will use vTargetMail database view in the AdventureWorksDW database.

During the data mining algorithm wizard, the Microsoft Naive Bayes algorithm should be selected as shown in the below image.

Of mild interest is that it’s a two-class classifier here, but it’s a multi-class classifier in the (much) later ML.NET.

Comments closed

Validating Errors in A/B Testing

Roland Stevenson shows us how to validate Type I and Type II errors when performing A/B tests in R:

In this post, we seek to develop an intuitive sense of what type I (false-positive) and type II (false-negative) errors represent when comparing metrics in A/B tests, in order to gain an appreciation for “peeking”, one of the major problems plaguing the analysis of A/B test today.

To better understand what “peeking” is, it helps to first understand how to properly run a test. We will focus on the case of testing whether there is a difference between the conversion rates cr_a and cr_b for groups A and B. We define conversion rate as the total number of conversions in a group divided by the total number of subjects. The basic idea is that we create two experiences, A and B, and give half of the randomly-selected subjects experience A and half B. Then, after some number of users have gone through our test, we measure how many conversions happened in each group. The important question is: how many users do we need to have in groups A and B in order to measure a difference in conversion rates of a particular size?

Read the whole thing. H/T R-Bloggers

Comments closed