Press "Enter" to skip to content

Category: R

Explaining the ROC Plot

Nina Zumel takes us through what each element of a ROC curve means:

In our data science teaching, we present the ROC plot (and the area under the curve of the plot, or AUC) as a useful tool for evaluating score-based classifier models, as well as for comparing multiple such models. The ROC is informative and useful, but it’s also perhaps overly concise for a beginner. This leads to a lot of questions from the students: what does the ROC tell us about a model? Why is a bigger AUC better? What does it all mean?

Read on for the answer.

Comments closed

Fun with Benford’s Law

Nagdev Amruthnath covers a topic which brings me joy:

Benford’s Law is one of the most underrated and widely used techniques that are commonly used in various applications. United States IRS neither confirms nor denies their use of Benford’s law to detect any number of manipulations in income tax filing. Across the Atlantic, the EU is very open and proudly claims its use of Benford’s law. Today, this is widely used in accounting to detect any fraud. Nigrini, a professor at the University of Cape Town, also used this law to identify financial discrepancies in Enron’s financial statement. In another case, Jennifer Golbeck, a professor at the University of Maryland, was able to identify bot accounts on twitter using Benford’s law. Xiaoyu Wang from the University of Winnipeg even published a report on how to use Benford’s law on images. In the rest of this article, we will take about Benford’s law and how it can be applied using R.

The applications to images and music were new to me. Very cool. H/T R-Bloggers

Comments closed

Covariance and Multicollinearity

Mattan Ben-Shachar gives us an intuitive understanding of multicollinearity and how it can affect an analysis:

The common and almost default approach is to fix age to a constant. This is really what our model does in the first place: the coefficient of height represents the expected change in weight while age is fixed and not allowed to vary. What constant? A natural candidate (and indeed emmeans’ default) is the mean. In our case, the mean age is 14.9 years. So the expected values produced above are for three 14.9 year olds with different heights. But is this data plausible? If I told you I saw a person who was 120cm tall, would you also assume they were 14.9 years old?

No, you would not. And that is exactly what covariance and multicollinearity mean – that some combinations of predictors are more likely than others.

I liked the explanation Mattan provides us. Also be sure to read the warnings near the end of the post around other things to try. H/T R-bloggers

Comments closed

Web-Optimized ggplot2 Themes

Petr Baranovskiy shares a few new themes:

This will be a very short post compared to the detailed stuff I usually write. Just what it says on the tin – I made some tweaks to my three favorite {ggplot2} themes – theme_bw(), theme_classic(), and theme_void() – to make the graphics more readable and generally look better when posted online, particularly in blog posts. Please feel free to borrow and use.

Also, I will be frequently using these themes in subsequent posts, and I’d like to be able to point readers here with a hyperlink instead of repeatedly posting the whole theme_web_…() code every time I am writing a post.

Click through for the definition of each theme. H/T R-Bloggers

Comments closed

EXTPTR_PTR Error with Rcpp

Rick Pack walks us through an error in R:

I experienced a need to update Rcpp when I attempted to install the readxlsb R package, which promised to enable me to read .xlsb files in R.

What happened next has been forgotten: Did the attempted update of Rcpp appear to succeed or fail? I did record that my attempted installation of readxlsb still failed and I now experienced an unfamiliar error when I opened and closed R Studio:

“The procedure entry point EXTPTR_PTR could not be located in the dynamic link library”

Read on to see how Rick solved this problem.

Comments closed

Credential and Secrets Management in R

Bernardo Lares walks us through some good practices around managing credentials and secrets in R:

I have several functions that live in my public lares library that use get_creds() to fetch my secrets. Some of them are used as credentials to query databasessend emails with API services such as Mailgun, ping notifications using Slack‘s webhook, interacting with Google Sheets programatically, fetching Facebook and Twitter’s API stuff, Typeform, Github, Hubspot… I even have a portfolio performance report for my personal investments. If you check the code underneath, you won’t find credentials written anywhere but the code will actually work (for me and for anyone that uses the library). So, how can we accomplish this?

Read on to learn how.

Comments closed

Choroplethr 3.6.4 on CRAN

Ari Lamstein announces that Choroplethr version 3.6.4 is now on CRAN:

Choroplethr v3.6.4 is now on CRAN. This is the first update to the package in two years, and was necessary because of a recent change to the tigris package, which choroplethr uses to make Census Tract maps. I also took this opportunity to add new example demographic data for Census Tracts.

Read on for a listing of the updates, examples, and a request from Ari to help keep the project up to date by finding a suitable sponsor. H/T R-Bloggers

Comments closed

Optimizing a Poisson Survival Model

Joshua Entrop shows off optimx() in R to perform a survival analysis:

In this blog post, we will fit a Poisson regression model by maximising its likelihood function using optimx() in R. As an example we will use the lung cancer data set included in the {survival} package. The data set includes information on 228 lung cancer patients from the North Central Cancer Treatment Group (NCCTG). Specifically, we will estimate the survival of lung cancer patients by sex and age using a simple Poisson regression model. You can download the code that I will use throughout post here

Read the whole thing. H/T R-bloggers

Comments closed

The Basics of Randomized Response

Holger von Jouanne-Diedrich explains how randomized response can protect any single person’s opinion from a pollster while providing insight on the whole population:

So, is there a method to find the respective proportion of people without putting them on the spot? Actually, there is! If you want to learn about randomized response (and how to create flowcharts in R along the way) read on!

The question is how can you get a truthful result overall without being able to attribute a certain answer to any single individual. As it turns out, there is a very elegant and ingenious method, called randomized response. The big idea is to, as the name suggests, add noise to every answer without compromising the overall proportion too much, i.e. add noise to every answer so that it cancels out overall!

Click through for the process. It’s definitely a clever idea.

Comments closed

Sparklyr 1.3 Released

Yitao Li announces sparklyr 1.3:

sparklyr 1.3 is now available on CRAN, with the following major new features:

Higher-order Functions to easily manipulate arrays and structs
– Support for Apache Avro, a row-oriented data serialization framework
Custom Serialization using R functions to read and write any data format
Other Improvements such as compatibility with EMR 6.0 & Spark 3.0, and initial support for Flint time series library

Between this and the work from the Spark side, we are seeing some nice quality of life improvements for Spark and R.

Comments closed