Press "Enter" to skip to content

Category: R

Reporting On Unit Tests In R With covrpage

Maelle Salmon recaps Locke Data’s involvement with the covrpage package:

To read more about getting started with covrpage in your own package in a few lines of code only, we recommend checking out the “get started” vignette. It explains more how to setup the Travis deploy, mentions which functions power the covrpage report, and gives more motivation for using covrpage.
And to learn how the information provided by covrpage should be read, read the “How to read the covrpage report” vignette.

Check it out.

Comments closed

The Intuition Behind Principal Component Analysis

Holger von Jouanne-Diedrich gives us an intuition behind how principal component analysis (PCA) works:


Principal component analysis (PCA) is a dimension-reduction method that can be used to reduce a large set of (often correlated) variables into a smaller set of (uncorrelated) variables, called principal components, which still contain most of the information.
PCA is a concept that is traditionally hard to grasp so instead of giving you the n’th mathematical derivation I will provide you with some intuition.
Basically PCA is nothing else but a projection of some higher dimensional object into a lower dimension. What sounds complicated is really something we encounter every day: when we watch TV we see a 2D-projection of 3D-objects!

Click through for the rest of the story.

Comments closed

Working With Missing Values In R

Anisa Dhana has a few examples of ways we can work with data containing missing values in R:

Imputation is a complex process that requires a good knowledge of your data. For example, it is crucial to know whether the missing is at random or not before you impute the data. I have read a nice tutorial which visualize the missing data and help to understand the type of missing, and another post showing how to impute the data with MICE package.

In this short post, I will focus on management of the missing data using the tidyverse package. Specifically, I will show how to manage missings in the long data format (i.e., more than one observation for id).

Anisa shows a few different techniques, depending upon what you need to do with the data.  I’d caution about using mean in the second example and instead typically prefer median, as replacing missing values with the median won’t alter the distribution in the way that it can with mean.

Comments closed

Gradient Boosting And XGBoost

Shirin Glander has another English-language transcript from a German video, this time covering gradient boosting techniques:

Let’s look at how Gradient Boosting works. Most of the magic is described in the name: “Gradient” plus “Boosting”.

Boosting builds models from individual so called “weak learners” in an iterative way. In the Random Forests part, I had already discussed the differences between Bagging and Boostingas tree ensemble methods. In boosting, the individual models are not built on completely random subsets of data and features but sequentially by putting more weight on instances with wrong predictions and high errors. The general idea behind this is that instances, which are hard to predict correctly (“difficult” cases) will be focused on during learning, so that the model learns from past mistakes. When we train each ensemble on a subset of the training set, we also call this Stochastic Gradient Boosting, which can help improve generalizability of our model.

The gradient is used to minimize a loss function, similar to how Neural Nets utilize gradient descent to optimize (“learn”) weights. In each round of training, the weak learner is built and its predictions are compared to the correct outcome that we expect. The distance between prediction and truth represents the error rate of our model. These errors can now be used to calculate the gradient. The gradient is nothing fancy, it is basically the partial derivative of our loss function – so it describes the steepness of our error function. The gradient can be used to find the direction in which to change the model parameters in order to (maximally) reduce the error in the next round of training by “descending the gradient”.

Along with neural networks, gradient boosting has become one of the dominant algorithms for machine learning, and is well worth learning about.

Comments closed

Visualizing Traditional Japanese Color Palettes

Chisato den Engelsen looks at 465 traditional color palettes used in Japan:

Since each of colours had name, I also was curious if there are some characters that are used more often than other. Colour name was written in two ways in this website. One in Kanji and other in Hiragana.

I love wordcloud2 to visualize the wordcloud, so I can see which characters appears more often the others.

It’s an interesting exercise and all of the R code is included.  Be sure to check out the list of colors with a character representing “rat” or “mouse” in the name.  H/T R-Bloggers

Comments closed

Azure SQL Database Supports R Integration

David Smith notes that Azure SQL Database now has (in preview) support for R:

Azure SQL Database, the database-as-a-service based on Microsoft SQL Server, now offers R integration. (The service is currently in preview; details on how to sign up for the preview are provided in that link.) While you’ve been able to run R in SQL Server in the cloud since the release of SQL Server 2016 by running a virtual machine, Azure SQL Database is a fully-managed instance that doesn’t require you to set up and maintain the underlying infrastructure. You just choose the size and scale of the database you want to manage, and then connect to it like any other SQL Server instance. (If you want to learn how to set up an Azure SQL database, this Microsoft Learn module is a good place to start.)

Python and Java are not yet supported, but I’d imagine that they’ll be on the way too.

Comments closed

Using R In Power BI For More Than Displaying Visuals

Patrick Mahoney shows us that you can do more with the R Visual component in Power BI than display visuals:

If you really like a certain R visual, you can also package it as a pbiviz file to share with others. Once you set up the foundation to create the first pbiviz, it is easy to crank out many more just by replacing the R code and repackaging it (into a different pbiviz file). See instruction here.

But this post isn’t about making charts. It turns out you can hijack the R visual to do lots of other things too. Below are a few examples:

Note: I am no R expert. The examples below are relatively simple and cobbled together from similar things online.  They may be a little clunky, but worth it, in my opinion, to be able to dynamically leverage many more of the R capabilities through Power BI.

Read on for some interesting examples.

Comments closed

Building A Gantt Chart With Plotly

Ellen Talbot shows us how to embrace our inner micromanagers:

Something a little different today for a quick chat about my latest project and why I’m finding the plotly package so helpful!

Are you like me and physically can’t function unless you’ve got a to do list in front of you? Well even if you’re not, imagine my pain while I’m wearing my non – Locke Data hat and trying to plan out the final year of my PhD thesis!

I needed something that updated easily, something visual and something to keep my supervisors in the know. I’ve previously made gantt charts using LaTeX but found it ridiculously clunky to get working and decided there had to be a better way. And if I could include interactivity then all the better, which is how I discovered plotly.

Admittedly, I like gantt charts more than almost any developer I’ve ever met.  They always look so pretty and are wonderful depictions of a world which will never be.

Comments closed

Working With Strings In Base R

Jozef Hajnala shows us that you don’t need stringr to do cool things with strings in R:

This post is aimed to serve as an overview of functionality provided by base R to work with strings. Note that the term “string” is used somewhat loosely and refers to character vectors and character strings. In R documentation, references to character string, refer to character vectors of length 1.

Also since this is an overview, we will not examine the details of the functions, but rather list examples with simple, intuitive explanations trading off technical precision.

As much as I like the tidyverse for its data platform professional-friendly approach to R, it is good to know the base libraries (and other alternatives) as well.  H/T R-Bloggers

Comments closed