Press "Enter" to skip to content

Category: R

Using OAuth 2 in R Packages

Maelle Salmon explains how OAuth 2 works and also how you can use it in R packages:

When writing an R package wrapping an API using OAuth 2.0 you’ll need the user to grant access to an “app”, which will allow to create an access token and a refresh token. The access token will then often be passed to the API in a header when making requests, whilst the refresh token would be posted in a query string when the access token needs to be renewed.

Your problem is: how do I imitate a third-party app? Thankfully for you, in most cases the complexity can be handled by the httr package. For other cases, or if you want to e.g. only use curl, you will have to get creative. 

Read on for more detail.

Comments closed

AzureCosmosR

Hong Ooi takes us through an R library for working with Cosmos DB:

Among other features, Azure Cosmos DB is notable in that it supports multiple data models and APIs. When you create a new Cosmos DB account, you specify which API you want to use: SQL/core API, which lets you use a dialect of T-SQL to query and manage tables and documents; MongoDB; Azure table storage; Cassandra; or Gremlin (graph). AzureCosmosR provides a comprehensive interface to the SQL API, as well as bridges to the MongoDB and table storage APIs. On the Resource Manager side, AzureCosmosR extends the AzureRMR class framework to allow creating and managing Cosmos DB accounts.

AzureCosmosR is now available on CRAN. You can also install the development version from GitHub, with devtools::install_github("Azure/AzureCosmosR").

Hong provides examples for us using three of the Cosmos DB APIs, so check it out.

Comments closed

Gradient Descent in R

Holger von Jouanne-Diedrich lays out the basics of gradient descent:

Gradient Descent is a mathematical algorithm to optimize functions, i.e. finding their minima or maxima. In Machine Learning it is used to minimize the cost function of many learning algorithms, e.g. artificial neural networks a.k.a. deep learning. The cost function simply is the function that measures how good a set of predictions is compared to the actual values (e.g. in regression problems).

The gradient (technically the negative gradient) is the direction of steepest descent. Just imagine a skier standing on top of a hill: the direction which points into the direction of steepest descent is the gradient!

Click through for an example in R.

Comments closed

Updates to AzureR

Hong Ooi has some updates for us:

This is an update on what’s been happening with the AzureR suite of packages. First, you may have noticed that just before the holiday season, the packages were updated on CRAN to change their maintainer email to a non-Microsoft address. This is because I’ve left Microsoft for a role at Westpac bank here in Australia; while I’m sad to be leaving, I do intend to continue maintaining and updating the packages.

To that end, here are the changes that have recently been submitted to CRAN, or will be shortly:

Read on for the changes. This includes a new package to work with Cosmos DB from R.

Comments closed

Countdown Number Puzzle

Tomaz Kastrun has a fun puzzle for us:

So the game is (was) known as a TV show where then host would give a random 3-digit number and the contestants would draw 6 random numbers from stack of numbers. Given the time limit, the winner was the one who would create a formula matching the result or being closest.

Many ways, tips, tricks and optimisations were already considered, maybe the most famous was the Reverse Polish notation where operators follow their operands and is a great fit for the game.

With useless functionality, I have decided to use permuteGeneral function from RcppAlgos or same functionality could be achieved with combn function.

Click through to see it in action.

Comments closed

Smoothing and its Inherent Risks

John Mount would like you to take care when using smoothers:

Here is a quick data-scientist / data-analyst question: what is the overall trend or shape in the following noisy data? For our specific example: How do we relate value as a noisy function (or relation) of m? This example arose in producing our tutorial “The Nature of Overfitting”.

One would think this would be safe and easy to asses in R using ggplot2::geom_smooth(), but now we are not so sure.

Here’s a quick summary of my general philosophy: the data are more interesting than a smoothed line. I’m okay putting in a smoothed line to help a reader make sense of a trend, but I wouldn’t want to have a plot with just the smoothed line. Read the whole thing from John to get well beyond my rule of thumb.

Comments closed

The Nature of Overfitting

John Mount has a nice essay on overfitting:

What is meant by “overfitting” is: the estimated f() will tend to show off or over perform on the data used to fit, train, or construct it. I have some notes on this sort of selection bias here: https://win-vector.com/2020/12/10/overfit-and-reversion-to-mediocrity-the-bane-of-data-science/.

Selecting a model that “looks good” is enough to bias the model’s evaluation with respect to the data set we said it “looked good” on. So even when using unbiased methods, the data scientist can introduce bias by choosing to use one model (say the one fit by logistic regression) over another (say using using an observed prevalence everywhere as a probability prediction).

The way I talk about overfitting is to say that we’ve trained a model which latches onto the particulars of the training data set. To the extent that the particulars of the training data set are matched by the broader world, that’s “fitting.” To the extent that the particulars of the training data set are unique to that data set and are not generally applicable, that’s “overfitting.” Generally, I don’t have any more time to get into what this means, but John dives into the topic in an accessible way.

Comments closed

Basic Theory on Correlation Analysis, Using R

Petr Baranovskiy wants to take us through the key concepts of correlation analysis, starting with basic theory:

When I was learning statistics, I was surprised by how few learning materials I personally found to be clear and accessible. This might be just me, but I suspect I am not the only one who feels this way. Also, everyone’s brain works differently, and different people would prefer different explanations. So I hope that this will be useful for people like myself – social scientists and economists – who may need a simpler and more hands-on approach.

These series are based on my notes and summaries of what I personally consider some the best textbooks and articles on basic stats, combined with the R code to illustrate the concepts and to give practical examples. Likely there are people out there whose cognitive processes are similar to mine, and who will hopefully find this series useful.

This is clear and well-written, so check it out even if you feel like you have a solid understanding of the topic.

Comments closed

Naive Bayes and Continuous Predictor Variables

Akhila takes us through the intuition of how Naive Bayes works:

Usually we use the e1071 package to build a Naive Bayes classifier in R. And then using this classifier, we make some predictions on the training data.

So probability for these predictions can be directly calculated based on frequency of occurrences if the features are categorical.
But what if, there are features with continuous values? What the Naive Bayes classifier is actually doing behind the scenes to predict the probabilities of continuous data?

Click through for the answer. Also, Naive Bayes isn’t Bayesian, but that’s not important.

Comments closed