Press "Enter" to skip to content

Category: R

Merging Datasets in R with the Tidyverse

Anisa Dhana shows off several tidyverse methods for combining data sets together:

semi_join
The semi_join function is different than the previous examples of joins. A semi join creates a new dataset in which there are all rows from the data1 where there is a corresponding matching value in data2. Still, instead of the final dataset merging both the first (data1) and second (data2) datasets, it only contains the variables from the first one (data1).

Most of this looks like standard SQL joins, but read through to the end for a bonus which doesn’t typically appear in relational database products.

Comments closed

Mocking Objects with R

The R-hub blog has an interesting post on creating mocks in R for unit testing:

In some of these cases, the programming concept you’re after is mocking, i.e. making a function act as if something were a certain way! In this blog post we shall offer a round-up of resources around mocking, or not mocking, when unit testing an R package.

It’s interesting watching data scientists work through the same sorts of problems which traditional developers have hit, whether that be testing, deployment, or source control management. H/T R-bloggers

Comments closed

Plotting Three-Dimensional Linear Models

Sebastian Sauer shows a few techniques for visualizing linear models with two predictors:

Linear models are a standard way of predicting or explaining some data. Visualizing data is not only of didactical value but provides heuristical value too, as demonstrated by Anscombe’s Quartet.

Visualizing linear models in 2D is straightforward, but visualizing linear models with more than one predictor is much less so. The aim of this post is to demonstrate some ways do visualize linear models with more than one predictor, using popular R packages. We will focus on 3D examples, that is, two predictors.

I have a strong bias against 3D visuals because they tend to be so difficult to see clearly. There are times when they’re necessary, though.

Comments closed

Re-Introducing rquery

John Mount has a new introduction to rquery:

rquery is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of R’s base::transform(), or dplyr’s dplyr::mutate() and uses a pipe in the style popularized in R with magrittr. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional SQL “window functions.” More on the background and context of rquery can be found here.

The R/rquery version of this introduction is here, and the Python/data_algebra version of this introduction is here.

Check it out.

Comments closed

rBokeh Tips for Missing Arguments

Matthias Nistler walks through troubleshooting rBokeh missing argument errors:

This approach is my go-to solution to change a rBokeh plot for which there is an argument missing in rBokeh that is available in python.
– Create the plot.
– Inspect the structure (str(plot)) of the rBokeh object.
– Search for the python’s argument name.
– Overwrite the value with the desired option as derived from python’s bokeh.

Given how nice the bokeh package looks, I really want rBokeh to work well. Hopefully this experience improves over time.

Comments closed

Using Schemas with DBI and SQL Server

Thomas Roh takes us through an oddity in R’s DBI library:

I ran into an issue the other day where I was tring to write a new table to a SQL Server Database with a non-default schema. I did end up spending a bit of time debugging and researching so I wanted to share for anyone else that runs into the issue. Using the DBI::Id function, allows you to specify the schema when you are trying to write a table to a SQL Server database.

Click through for the end result. I will say that the more I work with DBI, the more I’m tempted to keep using rodbc, at least when working with SQL Server. H/T R-Bloggers.

Comments closed

Building Custom R Packages

Brad Lindblad takes us through building a custom package in R:

Don’t repeat yourself (DRY) is a well-known maxim in software development, and most R programmers follow this rule and build functions to avoid duplicating code. But how often do you:
– Reference the same dataset in different analyses
– Create the same ODBC connection to a database
– Tinker with the same colors and themes in ggplot
– Produce markdown docs from the same template

and so on? Notice a pattern? The word “same” is sprinkled in each bullet point. I smell an opportunity to apply DRY!

This is a good point: packages don’t have to go out to the broader world. They’re useful even if they just help you (or your team) out. H/T R-bloggers

Comments closed

Evaluating a Classification Model with a Spam Filter

John Mount shares an extract from Mount and Nina Zumel’s Practical Data Science with R, 2nd Edition:

This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.

It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a side-effect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.

Click through for that extract. I liked the first edition of the book, so I’m looking forward to the 2nd.

Comments closed

Fun with [ as a Function in R

John Mount definitely quotes Dr. Ian Malcolm correctly in this one:

How about defining a new [-based function call notation? The ideas is: we could write sin[5] in place of sin(5), thus unifying the notations for function call and array access. Some languages do in fact have unified function call and array access (though often using “(” for both). Examples languages include Fortran and Matlab.

Let’s add R to the list of such languages.

I love the flexibility in the language, almost as much as I would enjoy taking away production rights from the person who ships this in my code base…

Comments closed