Press "Enter" to skip to content

Category: R

Using xplain To Interpret Model Results

Joachim Zuckarelli walks us through the xplain package in R:

The above XML produces the following output (don’t worry too much about the call of xplain(), we will discuss later on in more detail how to work with the xplain() function):

library(car)
library(xplain)
xplain(call="lm(education ~ young + income + urban, data=Anscombe)", 
xml="http://www.zuckarelli.de/xplain/example_lm_foreach.xml")

##
## Call:
## lm(formula = education ~ young + income + urban, data = Anscombe)
##
## Coefficients:
## (Intercept) young income urban
## -286.83876 0.81734 0.08065 -0.10581
##
##
## Interpreting the coefficients
## —————————–
## Your coefficient ‘(Intercept)’ is smaller than zero.
##
## Your coefficient ‘young’ is larger than zero. This means that the
## value of your dependent variable ‘education’ changes by 0.82 for
## any increase of 1 in your independent variable ‘young’.
##
## Your coefficient ‘income’ is larger than zero. This means that the
## value of your dependent variable ‘education’ changes by 0.081 for
## any increase of 1 in your independent variable ‘income’.
##
## Your coefficient ‘urban’ is smaller than zero. This means that the
## value of your dependent variable ‘education’ changes by -0.11 for
## any increase of 1 in your independent variable ‘urban’.

I’ll be interested in looking at this in more detail, though my first glance indication is that it’ll be useful mostly in large shops with different teams creating and using models.

Comments closed

Sentiment Analysis Of Hotel California

Sara Locatelli analyzes the lyrics to Hotel California using tidytext:

Sentiment analysis is a method of natural language processing that involves classifying words in a document based on whether a word is positive or negative, or whether it is related to a set of basic human emotions; the exact results differ based on the sentiment analysis method selected. The tidytext R package has 4 different sentiment analysis methods:

  • “AFINN” for Finn Årup Nielsen – which classifies words from -5 to +5 in terms of negative or positive valence
  • “bing” for Bing Liu and colleagues – which classifies words as either positive or negative
  • “loughran” for Loughran-McDonald – mostly for financial and nonfiction works, which classifies as positive or negative, as well as topics of uncertainty, litigious, modal, and constraining
  • “nrc” for the NRC lexicon – which classifies words into eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) as well as positive or negative sentiment

Sentiment analysis works on unigrams – single words – but you can aggregate across multiple words to look at sentiment across a text.

To demonstrate sentiment analysis, I’ll use one of my favorite songs: “Hotel California” by the Eagles.

Read the whole thing, though you can’t check out afterward.

Comments closed

Digging Into The SQL Compute Context With R Services

Niels Berglund dives into how the SQL Compute Context works with R Services:

In the code above we use the RxInSqlServer() function to indicate we want to execute in a SQL context. The connectionString property defines where we execute, and the numTasks property sets the number of tasks (processes) to run for each computation, in Code Snippet 4 it is set to 1 which from a processing perspective should match what we do in Code Snippet 3. Before we execute the code in Code Snippet 4 we do what we did before we ran the code in Code Snippet 3:

  • Run Process Explorer as admin.
  • Navigate to the devenv.exe process in Process Explorer.
  • In addition, also look at the Launchpad.exe process in Process Explorer.

When we execute we see that the BxlServer.exe processes under the Microsoft.R.Host.exe processes are idling, but when we look at the Launchpad.exe process we see this:

This is a bit deep but interesting reading.

Comments closed

Creating Choropleths With ggcounty

Sebastian Sauer has a quick example of using ggcounty to plot data on a map of US counties:

This posts shows how easy it can be to build an visually pleasing plot. We will use hrbrmster’s ggcounty, which is an R package at this Github repo. Graphics engine is as mostly in my plots, Hadley Wickhams ggplot. All build on R. Standing on shoulders…

Disclaimer: This example heavily draws on hrbrmster example on this page. All credit is due to Rudy, and those on whose work he built up on.

In just a few lines of code, you can have a pretty nice map.

Comments closed

Using The Map Function In R

Nicolas Attalides on using purrr:

The best place to start when exploring the purrr package is the map function. The reader will notice that these functions are utilised in a very similar way to the apply family of functions. The subtle difference is that the purrr functions are consistent and the user can be assured of the output – as opposed to some cases when using for example sapply as I demonstrate later on.

My considered belief is, Always Be Purrring.  H/T R-bloggers

Comments closed

Tuning xgboost Models In R

Gabriel Vasconcelos has a new series on tuning xgboost models:

My favourite Boosting package is the xgboost, which will be used in all examples below. Before going to the data let’s talk about some of the parameters I believe to be the most important. These parameters mostly are used to control how much the model may fit to the data. We would like to have a fit that captures the structure of the data but only the real structure. In other words, we do not want the model to fit noise because this will be translated in a poor out-of-sample performance.

  • eta: Learning (or shrinkage) parameter. It controls how much information from a new tree will be used in the Boosting. This parameter must be bigger than 0 and limited to 1. If it is close to zero we will use only a small piece of information from each new tree. If we set eta to 1 we will use all information from the new tree. Big values of eta result in a faster convergence and more over-fitting problems. Small values may need to many trees to converge.

  • colsample_bylevel: Just like Random Forests, some times it is good to look only at a few variables to grow each new node in a tree. If we look at all variables the algorithm needs less trees to converge, but looking at, for example, 2/3 of the variables may result in models more robust to over-fitting. There is a similar parameter called colsample_bytree that re-sample the variables in each new tree instead of each new node.

Read the whole thing.  H/T R-bloggers

Comments closed

Converting Between Time Series Classes In R

Christoph Sax announces a new R library:

tsbox, now freshly on CRAN, provides a set of tools that are agnostic towards existing time series classes. It is built around a set of converters, which convert time series stored as tsxtsdata.framedata.tabletibblezootsibble or timeSeries to each other.

If you have to work with time series data, this will be a useful library.  H/T R-Bloggers

Comments closed

Power BI Custom Visuals In Excel

David Smith notes that Excel is getting a bit of an upgrade:

This week at the BUILD conference, Microsoft announced that Power BI custom visuals will soon be available as charts with Excel. You’ll be able to choose a range of data within an Excel workbook, and pass those data to one of the built-in Power BI custom visuals, or one you’ve created yourself using the API.

David’s point is that you can bring in R charts, but it extends to more than that.

Comments closed

Building Flow Charts In R

Alan Haynes shows how to build flow charts in R using the grid Gmisc packages:

Flow charts are an important part of a clinical trial report. Making them can be a pain though. One good way to do it seems to be with the grid and Gmisc packages in R. X and Y coordinates can be designated based on the center of the boxes in normalized device coordinates (proportions of the device space – 0.5 is this middle) which saves a lot of messing around with corners of boxes and arrows.

A very basic flow chart, based very roughly on the CONSORT version, can be generated as follows…

Click through for sample code and a resulting image.  H/T R-bloggers

Comments closed

Building Palettes From Pictures In R

Andrea Cirillo takes inspiration from the great works to build palettes:

If you see this painting you will find a profound of colours with a great equilibrium between different hues, the hardy usage of complementary colours and the ability expressed in the “chiaroscuro” technique. While I was looking at the painting I started, wondering how we moved from this wisdom to the ugly charts you can easily find within today’s corporate reports ( find a great sample on the WTF visualization website)

This is where Paletter comes from: bring the Renaissance wisdom and beauty within the plots we produce every day.

Introducing paletter

PaletteR is a lean R package which lets you draw from any custom image an optimized palette of colours. The package extracts a custom number of representative colours from the image. Let’s try to apply it on the “Vergine con il Bambino, angeli e Santi” before looking into its functional specification.

It’s an interesting package.  I’ll have to play around with it.

Comments closed