R – Page 61 – Curated SQL

Reporting on Correlation Analysis in R

Published 2021-02-02 by Kevin Feasel

Petr Baranovskiy continues a series on correlation analysis using R:

This is the second part of the Correlation Analysis in R series. In this post, I will provide an overview of some of the packages and functions used to perform correlation analysis in R, and will then address reporting and visualizing correlations as text, tables, and correlation matrices in online and print publications.

Read the whole thing.

Comments closed

Model Post-Processing with insight

Published 2021-02-01 by Kevin Feasel

The easystats team talks about the insight package in R:

We are talking about the insight package. It is what allows other packages, like easystats (parameters, effectsize, performance, report, …) or ggstatsplot, sjstats or modelsummary to be as powerful as they are, supporting tons of different R models. So why make you life hard when you can be like them, and rely on insight?
It is made for developers (and users) that do some postprocessing of different models (e.g., extracting stuff like parameters, values, data, names, specifications, predictions, priors, etc.), whether it is to nicely display their results or to do further computation.

Click through for an example of what it does and how it works. H/T R-bloggers

Comments closed

Determining a Good Test Set Size

Published 2021-02-01 by Kevin Feasel

John Mount thinks about test set size:

In this note we will answer “what is a good test set size?” three ways.
– The usual practical answer.
– A decision theory answer.
– A novel variational answer.
Each of these answers is a bit different, as they are solved in slightly different assumed contexts and optimizing different objectives. Knowing all 3 solutions gives us some perspective on the problem.

My rule of thumb is that I want it to be as small as possible while containing the highest likelihood of hitting all real-world scenarios enough times to provide a valid comparison. This conversely maximizes the size of the training data set, giving us the best chance of seeing the widest variety of scenarios we can during the formative phase.

And as usual, John goes way deeper than my rules of thumb. I like this post a lot.

Comments closed

Using OAuth 2 in R Packages

Published 2021-01-25 by Kevin Feasel

Maelle Salmon explains how OAuth 2 works and also how you can use it in R packages:

When writing an R package wrapping an API using OAuth 2.0 you’ll need the user to grant access to an “app”, which will allow to create an access token and a refresh token. The access token will then often be passed to the API in a header when making requests, whilst the refresh token would be posted in a query string when the access token needs to be renewed.
Your problem is: how do I imitate a third-party app? Thankfully for you, in most cases the complexity can be handled by the httr package. For other cases, or if you want to e.g. only use curl, you will have to get creative.

Read on for more detail.

Comments closed

AzureCosmosR

Published 2021-01-22 by Kevin Feasel

Hong Ooi takes us through an R library for working with Cosmos DB:

Among other features, Azure Cosmos DB is notable in that it supports multiple data models and APIs. When you create a new Cosmos DB account, you specify which API you want to use: SQL/core API, which lets you use a dialect of T-SQL to query and manage tables and documents; MongoDB; Azure table storage; Cassandra; or Gremlin (graph). AzureCosmosR provides a comprehensive interface to the SQL API, as well as bridges to the MongoDB and table storage APIs. On the Resource Manager side, AzureCosmosR extends the AzureRMR class framework to allow creating and managing Cosmos DB accounts.
AzureCosmosR is now available on CRAN. You can also install the development version from GitHub, with devtools::install_github("Azure/AzureCosmosR").

Hong provides examples for us using three of the Cosmos DB APIs, so check it out.

Comments closed

Gradient Descent in R

Published 2021-01-20 by Kevin Feasel

Holger von Jouanne-Diedrich lays out the basics of gradient descent:

Gradient Descent is a mathematical algorithm to optimize functions, i.e. finding their minima or maxima. In Machine Learning it is used to minimize the cost function of many learning algorithms, e.g. artificial neural networks a.k.a. deep learning. The cost function simply is the function that measures how good a set of predictions is compared to the actual values (e.g. in regression problems).
The gradient (technically the negative gradient) is the direction of steepest descent. Just imagine a skier standing on top of a hill: the direction which points into the direction of steepest descent is the gradient!

Click through for an example in R.

Comments closed

Updates to AzureR

Published 2021-01-14 by Kevin Feasel

Hong Ooi has some updates for us:

This is an update on what’s been happening with the AzureR suite of packages. First, you may have noticed that just before the holiday season, the packages were updated on CRAN to change their maintainer email to a non-Microsoft address. This is because I’ve left Microsoft for a role at Westpac bank here in Australia; while I’m sad to be leaving, I do intend to continue maintaining and updating the packages.
To that end, here are the changes that have recently been submitted to CRAN, or will be shortly:

Read on for the changes. This includes a new package to work with Cosmos DB from R.

Comments closed

Countdown Number Puzzle

Published 2021-01-13 by Kevin Feasel

Tomaz Kastrun has a fun puzzle for us:

So the game is (was) known as a TV show where then host would give a random 3-digit number and the contestants would draw 6 random numbers from stack of numbers. Given the time limit, the winner was the one who would create a formula matching the result or being closest.
Many ways, tips, tricks and optimisations were already considered, maybe the most famous was the Reverse Polish notation where operators follow their operands and is a great fit for the game.
With useless functionality, I have decided to use permuteGeneral function from RcppAlgos or same functionality could be achieved with combn function.

Click through to see it in action.

Comments closed

Smoothing and its Inherent Risks

Published 2021-01-11 by Kevin Feasel

John Mount would like you to take care when using smoothers:

Here is a quick data-scientist / data-analyst question: what is the overall trend or shape in the following noisy data? For our specific example: How do we relate value as a noisy function (or relation) of m? This example arose in producing our tutorial “The Nature of Overfitting”.
One would think this would be safe and easy to asses in R using ggplot2::geom_smooth(), but now we are not so sure.

Here’s a quick summary of my general philosophy: the data are more interesting than a smoothed line. I’m okay putting in a smoothed line to help a reader make sense of a trend, but I wouldn’t want to have a plot with just the smoothed line. Read the whole thing from John to get well beyond my rule of thumb.

Comments closed

The Nature of Overfitting

Published 2021-01-08 by Kevin Feasel

John Mount has a nice essay on overfitting:

What is meant by “overfitting” is: the estimated f() will tend to show off or over perform on the data used to fit, train, or construct it. I have some notes on this sort of selection bias here: https://win-vector.com/2020/12/10/overfit-and-reversion-to-mediocrity-the-bane-of-data-science/.
Selecting a model that “looks good” is enough to bias the model’s evaluation with respect to the data set we said it “looked good” on. So even when using unbiased methods, the data scientist can introduce bias by choosing to use one model (say the one fit by logistic regression) over another (say using using an observed prevalence everywhere as a probability prediction).

The way I talk about overfitting is to say that we’ve trained a model which latches onto the particulars of the training data set. To the extent that the particulars of the training data set are matched by the broader world, that’s “fitting.” To the extent that the particulars of the training data set are unique to that data set and are not generally applicable, that’s “overfitting.” Generally, I don’t have any more time to get into what this means, but John dives into the topic in an accessible way.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Category: R