Category: R

Gradient Boosting And XGBoost

Shirin Glander has another English-language transcript from a German video, this time covering gradient boosting techniques:

Let’s look at how Gradient Boosting works. Most of the magic is described in the name: “Gradient” plus “Boosting”.

Boosting builds models from individual so called “weak learners” in an iterative way. In the Random Forests part, I had already discussed the differences between Bagging and Boostingas tree ensemble methods. In boosting, the individual models are not built on completely random subsets of data and features but sequentially by putting more weight on instances with wrong predictions and high errors. The general idea behind this is that instances, which are hard to predict correctly (“difficult” cases) will be focused on during learning, so that the model learns from past mistakes. When we train each ensemble on a subset of the training set, we also call this Stochastic Gradient Boosting, which can help improve generalizability of our model.

The gradient is used to minimize a loss function, similar to how Neural Nets utilize gradient descent to optimize (“learn”) weights. In each round of training, the weak learner is built and its predictions are compared to the correct outcome that we expect. The distance between prediction and truth represents the error rate of our model. These errors can now be used to calculate the gradient. The gradient is nothing fancy, it is basically the partial derivative of our loss function – so it describes the steepness of our error function. The gradient can be used to find the direction in which to change the model parameters in order to (maximally) reduce the error in the next round of training by “descending the gradient”.

Along with neural networks, gradient boosting has become one of the dominant algorithms for machine learning, and is well worth learning about.

Visualizing Traditional Japanese Color Palettes

Chisato den Engelsen looks at 465 traditional color palettes used in Japan:

Since each of colours had name, I also was curious if there are some characters that are used more often than other. Colour name was written in two ways in this website. One in Kanji and other in Hiragana.

I love wordcloud2 to visualize the wordcloud, so I can see which characters appears more often the others.

It’s an interesting exercise and all of the R code is included.  Be sure to check out the list of colors with a character representing “rat” or “mouse” in the name.  H/T R-Bloggers

Azure SQL Database Supports R Integration

David Smith notes that Azure SQL Database now has (in preview) support for R:

Azure SQL Database, the database-as-a-service based on Microsoft SQL Server, now offers R integration. (The service is currently in preview; details on how to sign up for the preview are provided in that link.) While you’ve been able to run R in SQL Server in the cloud since the release of SQL Server 2016 by running a virtual machine, Azure SQL Database is a fully-managed instance that doesn’t require you to set up and maintain the underlying infrastructure. You just choose the size and scale of the database you want to manage, and then connect to it like any other SQL Server instance. (If you want to learn how to set up an Azure SQL database, this Microsoft Learn module is a good place to start.)

Python and Java are not yet supported, but I’d imagine that they’ll be on the way too.

Using R In Power BI For More Than Displaying Visuals

Patrick Mahoney shows us that you can do more with the R Visual component in Power BI than display visuals:

If you really like a certain R visual, you can also package it as a pbiviz file to share with others. Once you set up the foundation to create the first pbiviz, it is easy to crank out many more just by replacing the R code and repackaging it (into a different pbiviz file). See instruction here.

But this post isn’t about making charts. It turns out you can hijack the R visual to do lots of other things too. Below are a few examples:

Note: I am no R expert. The examples below are relatively simple and cobbled together from similar things online.  They may be a little clunky, but worth it, in my opinion, to be able to dynamically leverage many more of the R capabilities through Power BI.

Read on for some interesting examples.

Building A Gantt Chart With Plotly

Ellen Talbot shows us how to embrace our inner micromanagers:

Something a little different today for a quick chat about my latest project and why I’m finding the plotly package so helpful!

Are you like me and physically can’t function unless you’ve got a to do list in front of you? Well even if you’re not, imagine my pain while I’m wearing my non – Locke Data hat and trying to plan out the final year of my PhD thesis!

I needed something that updated easily, something visual and something to keep my supervisors in the know. I’ve previously made gantt charts using LaTeX but found it ridiculously clunky to get working and decided there had to be a better way. And if I could include interactivity then all the better, which is how I discovered plotly.

Admittedly, I like gantt charts more than almost any developer I’ve ever met.  They always look so pretty and are wonderful depictions of a world which will never be.

Working With Strings In Base R

Jozef Hajnala shows us that you don’t need stringr to do cool things with strings in R:

This post is aimed to serve as an overview of functionality provided by base R to work with strings. Note that the term “string” is used somewhat loosely and refers to character vectors and character strings. In R documentation, references to character string, refer to character vectors of length 1.

Also since this is an overview, we will not examine the details of the functions, but rather list examples with simple, intuitive explanations trading off technical precision.

As much as I like the tidyverse for its data platform professional-friendly approach to R, it is good to know the base libraries (and other alternatives) as well.  H/T R-Bloggers

Quick Geospatial Data Plots In R And Python

Harry McLellan shows us how we can use R and Python to generate quick-and-dirty plots of geospatial data:

Now R has some useful packages like ggmap, mapdata and ggplot2 which allow you to source you map satellite images directly from google maps, but this does require a free google API key to source from the cloud. These packages can also plot the map around the data as I am currently trimming the map to fit the data. But for a fair test I also used a simplistic pre-built map in R. This was from the package rworldmap, which allows plotting at a country level with defined borders. Axes can be scaled to act like a zoom function but without a higher resolutions map or raster satellite image map it is pointless to go past a country level.

There’s a lot more you can do with both languages, but when you just want a plot in a few lines of code, both are up to the task.

Dealing With Zero-Value Rows In dplyr

Kieran Healy shows an oddity in dplyr when dealing with zero-value records:

That looks fine. You can see in each panel the 2015 column is 100% Men. If we were working on this a bit longer we’d polish up the x-axis so that the dates were centered under the columns. But as an exploratory plot it’s fine.

But let’s say that, instead of a column plot, you looked at a line plot instead. This would be a natural thing to do given that time is on the x-axis and so you’re looking at a trend, albeit one over a small number of years.

This is behavior I hadn’t run into, and it does seem a bit odd.  On a totally unrelated note, Healy’s Data Visualization: A Practical Introduction is one of the best books on the topic.

Running R Scripts In Power BI’s Query Editor

Brad Lewellyn walks us through the process of executing an R script against a table in Power Query:

If you aren’t able to open the R Script Editor, check out our previous post, Getting Started with R Scripts.  While it’s possible to develop and test code using the built-in R Script Editor, it’s not great.  Unfortunately, there doesn’t seem to be a way to develop this script using an external IDE like RStudio.  So, we typically export files to csv for development in RStudio.  This is obviously not optimal and should be done with caution when data is extremely large or sensitive in some way.  Fortunately, the write.csv() function is pretty easy to use.  You can read more about it here.

It’s not a perfect experience, but Brad does show us how to get it done.

The Lesser-Known Apply Functions In R

Andrew Treadway covers a few of the lesser-known apply functions in R:


Let’s start with rapply. This function has a couple of different purposes. One is to recursively apply a function to a list. We’ll get to that in a moment. The other use of rapply is to a apply a function to only those elements in a list (or columns in a data frame) that belong to a specified class. For example, let’s say we have a data frame with a mix of categorical and numeric variables, but we want to evaluate a function only on the numeric variables.

Click through for some examples of rapply as well as vapply and eapply.  I’ve used rapply to get cardinality of each feature in a data frame but the other two are new to me.  H/T R-bloggers

