Press "Enter" to skip to content

Category: R

Repeated Cross-Validation in R

Ludvig Olsen walks us through a couple of nice R packages:

The benefits of using groupdata2 to create the folds are 1) that it allows us to balance the ratios of our output classes (or simply a categorical column, if we are working with linear regression instead of classification), and 2) that it allows us to keep all observations with a specific ID (e.g. participant/user ID) in the same fold to avoid leakage between the folds.

The benefit of cvms is that it trains all the models and outputs a tibble (data frame) with results, predictions, model coefficients, and other sweet stuff, which is easy to add to a report or do further analyses on. It even allows us to cross-validate multiple model formulas at once to quickly compare them and select the best model.

Ludvig also gives us some examples of how both packages can help you out. H/T R-Bloggers

Comments closed

Microsoft R Open 3.5.2 and 3.5.3

David Smith announces Microsoft R Open 3.5.2 and reveals when 3.5.3 comes out:

It’s taken a little bit longer than usual, but Microsoft R Open 3.5.2 (MRO) is now available for download for Windows and Linux. This update is based on R 3.5.2, and accordingly fixes a few minor bugs compared to MRO 3.5.1. The main change you will note is that new CRAN packages released since R 3.5.1 can now be used with this version of MRO.

David also lets us know that they’re working on 3.6.0’s release.

Comments closed

Exploratory Data Analysis on Categorical Variables

Giorgio Garziano continues digging into earthquake data:

To understand relationship or dependencies among categorical variables, we take advantage of various types of tables and graphical methods. Also stratifying variables can be encompassed in order to highlight if the relationship between two primary variables is the same or different for all levels of the stratifying variable under consideration.

The contingency table are said to be of one-way flavor when involving just one categorical variable. They are said two-way when involving two categorical variables, and so on (N-way).

Read on for various techniques for data analysis against categorical variables.

Comments closed

UpSet Plots for Set Analysis

Laura Ellis digs into the UpSetR package:

UpSet plots have a very cool parameter called queries. Queries can be used to define a subset of the data that you would like to highlight in your graph. The queries property takes in a list of query lists which means that you can pass multiple queries into the same graph. Each query list allows you to set a number of properties about how the query should function.

In this example we are viewing the Cycle and Walk set intersection (query and params). We want the query to be highlighted in a nice pink (color). We want to display the query as a highlighted overlap (active) and we will give it a name that we add to the chart legend (query.name)

I’ve not seen an UpSet plot before but it dumps a lot of information into a relatively small space. I’ll have to spend some time learning more about these plots.

Comments closed

AzureGraph: Microsoft Graph in R

Hong Ooi takes us through AzureGraph:

Microsoft Graph is a comprehensive framework for accessing data in various online Microsoft services, including Azure Active Directory (AAD), Office 365, OneDrive, Teams, and more. AzureGraph is an R package that provides a simple R6-based interface to the Graph REST API, and is the companion package to AzureRMR and AzureAuth.

Currently, AzureGraph aims to provide an R interface only to the AAD part, with a view to supporting R interoperability with Azure: registered apps and service principals, users and groups. Like AzureRMR, it could potentially be extended to support other services.

Just to clarify, this is like Facebook Graph API for Azure components, not a graph database that you can store your own data in.

Comments closed

Data Layout in R with cdata

John Mount takes us through a few sample problems and how to reshape data with cdata:

This may seem like a lot of steps, but it is only because we are taking the problems very slowly. The important point is that we want to minimize additional problem solving when applying the cdata methodology. Usually when you need to transform data you are in the middle of some other more important task, so you want to delegate the details of how the layout transform is implemented. With cdata the user is not asked to perform additional puzzle solving to guess a sequence of operators that may implement the desired data layout transform. The cdata solution pattern is always the same, which can help in mastering it.

With cdata, record layout transforms are simple R objects with detailed print() methods- so they are convenient to alter, save, and re-use later. The record layout transform also documents the expected columns and constants of the incoming data.

Check it out.

Comments closed

Exploratory Analysis of Earthquake Data

Giorgio Garziano walks us through an earthquake data set:

Boxplots for each quantitative variables are shown. We take advantage of the quantitative variable names (quantitative_vars) determined before to apply a ggplot2 package based boxplot function. The Y axis labeling and title are determined by the variable to be plot. Further, legend is not displayed and we adopt the coordinate flip option for improved readability.

Check it out to get an idea of how to do exploratory data analysis.

Comments closed

Naive Bays in R

Zulaikha Lateef takes us through the Naive Bayes algorithm and implementations in R:

Naive Bayes is a Supervised Machine Learning algorithm based on the Bayes Theorem that is used to solve classification problems by following a probabilistic approach. It is based on the idea that the predictor variables in a Machine Learning model are independent of each other. Meaning that the outcome of a model depends on a set of independent variables that have nothing to do with each other. 

Naive Bayes is one of the simplest algorithms available and yet it works pretty well most of the time. It’s almost never the best solution but it’s typically good enough to give you an idea of whether you can get a job done.

Comments closed