Press "Enter" to skip to content

Category: R

Lists In R

Dave Mason continues his process of learning about R data structures with a survey of lists:

In previous lessons, we’ve noted vectors and matrices consist of data elements of the same class. R will coerce data elements to a single class if we attempt to create a vector or matrix with data elements of differing classes. Lists, on the other hand, can hold data elements of different classes, such as the integer, character, or logical class. In fact, a list can hold most anything in R, including vectors, matrices, and many more! None to my surprise, lists can be created with the list() function:

And if you want to work with lists, purrr is a great package to learn.

Comments closed

Values Belong In Columns

John Mount argues that to reduce ambiguity, ensure that your values are columns on appropriate data frames:

Here is an (artificial) example.

chamber_sizes <- mtcars$disp/mtcars$cyl
form <- hp ~ chamber_sizes
model <- lm(form, data = mtcars)
print(model)
# Call:
# lm(formula = form, data = mtcars)
#
# Coefficients:
#   (Intercept)  chamber_sizes  
#         2.937          4.104  

Notice: one of the variables came from a vector in the environment, not from the primary data.framechamber_sizes was first looked for in the data.frame, and then in the environment the formula was defined (which happens to be the global environment), and (if that hadn’t worked) in the executing environment (which is again the global environment).

Our advice is: do not do that. Place all of your values in columns. Make it unambiguous all variables are names of columns in your data.frame of interest. This allows you to write simple code that works over explicit data. The style we recommend looks like the following.

Read the whole thing.

Comments closed

Using R With Excel

David Smith walks us through various ways to integrate R and Excel:

If you’re familiar with analyzing data in Excel and want to learn how to work with the same data in R, Alyssa Columbus has put together a very useful guide: How To Use R With Excel. In addition to providing you with a guide for installing and setting up R and the RStudio IDE, it provide a wealth of useful tips for working with Excel data in R, including:

  • To import Excel data into R, use the readxl package

  • To export Excel data from R, use the openxlsx package

  • How to remove symbols like “$” and “%” from currency and percentage columns in Excel, and convert them to numeric variables suitable for analysis in R

  • How to do computations on variables in R, and a list of common Excel functions (like RAND and VLOOKUP) with their R equivalents

  • How to emulate common Excel chart types (like histograms and line plots) using R plotting functions

David also shows how to run R within Excel.  One of the big benefits of readxl is that it doesn’t require Java; most other Excel readers do.

Comments closed

Factors In R

Dave Mason continues his look at R, this time covering the concept of factors:

Factor data can be nominal or ordinal. In our examples so far, it is nominal. “C”, “G”, and “F” (and “Center”, “Guard”, and “Forward” for that matter) are names that have no comparative order to each other. It’s not meaningful to say a Center is greater than a Forward or a Forward is less than a Guard (keep in mind these are position names–don’t let height cloud your thinking). If we try making a comparison, we get a warning message:

> position_factor[1] > position_factor[2]
[1] NA
Warning message:
In Ops.factor(position_factor[1], position_factor[2]) :
  ‘>’ not meaningful for factors

Ordinal data, on the other hand, can be compared to each other in some ranked fashion–it has order. Take bed sizes, for instance. A “Twin” bed is smaller than a “Full”, which is smaller than a “Queen”, which is smaller than a “King”. To create a factor with ordered (ranked) levels, use the ordered parameter, which is a logical flag to indicate if the levels should be regarded as ordered (in the order given).

Check it out.

Comments closed

Examples Of Charts In Different Languages

David Smith points out a great repository of information on generating different types of charts in different libraries:

The visualization tools include applications like Excel, Power BI and Tableau; languages and libraries including R, Stata, and Python’s matplotlib); and frameworks like D3. The data visualizations range from the standard to the esoteric, and follow the taxonomy of the book Data Visualisation (also by Andy Kirk). The chart categories are color coded by row: categorical (including bar charts, dot plots); hierarchical (donut charts, treemaps); relational (scatterplots, sankey diagrams); temporal (line charts, stream graphs) and spatial (choropleths, cartograms).

Check out the Chartmaker Directory.

Comments closed

More On Radix Sorting In R

Inaki Ucar explains some of the nuance behind sorting in R:

The latest R tip in Win-Vector Blog encourages you to Use Radix Sort based on a simple benchmark showing a x35 speedup compared to the default method, but with no further explanation. In my opinion, though, the complete tip would be, instead, use radix sort… if you know what you are doing, because a quick benchmark shouldn’t spare you the effort of actually reading the docs. And here is a spoiler: you are already using it.

One may wonder why R’s default sorting algorithm is so bad, and why was even chosen. The thing is that there is a trick here, and to understand it, first we must understand the benchmark’s data and then read the docs.

Read the whole thing.

Comments closed

Your R Code Should Be In Source Control Too

Lindsay Carr explains the importance of storing your R code in source control:

But wait, I would need to learn an additional tool?

Yes, but don’t panic! Git is a tool with various commands that you can use to help track your changes. Luckily, you don’t need to know too many commands in Git to use the basic functionality. As an added bonus, using Git with RStudio takes away some of the burden of knowing Git commands by including buttons for common actions.

As with any tool that you pick up to help your scientific workflows, there is some upfront work before you can start seeing the benefits. Don’t let that deter you. Git can be very easy once you get the gist. Think about the benefits of being able to track changes: you can make some changes, have a record of that change and who made it, and you can tie that change to a specific problem that was reported or feature request that was noted.

It’s still code, and you gain a lot by keeping code in source control.

Comments closed

Using The glue Package In R

Evgeni Chasnovski shows the glue package and also works around some trickiness with NULL:

Recently, fate lead me to try using {glue} in a package. I was very pleased to how it makes code more readable, which I believe is a very important during package development. However, I stumbled upon this pretty unexpected behavior:

y <- NULL
paste("I have", x, "apples and", y, "oranges.")
## [1] "I have 10 apples and  oranges."
str(glue("I have {x} apples and {y} oranges."))
## Classes 'glue', 'character'  chr(0)

If one of the expressions is evaluated into NULL then the output becomes empty string.

glue reminds me of string formatting in .NET languages.  On the whole, that’s a good thing.

Comments closed

Solving Linear Optimization Problems In R

Mic walks us through a linear optimization problem and solves it with the lpSolve package:

I’m going to implement in R an example of linear optimization that I found in the book “Modeling and Solving Linear Programming with R” by Jose M. Sallan, Oriol Lordan and Vincenc Fernandez.  The example is named “Production of two models of chairs” and can be found at page 57, section 3.5. I’m going to solve only the first point.

The problem text is the following

A company produces two models of chairs: 4P and 3P. The model 4P needs 4 legs, 1 seat and 1 back. On the other hand, the model 3P needs 3 legs and 1 seat. The company has a initial stock of 200 legs, 500 seats and 100 backs. If the company needs more legs, seats and backs, it can buy standard wood blocks, whose cost is 80 euro per block. The company can produce 10 seats, 20 legs and 2 backs from a standard wood block. The cost of producing the model 4P is 30 euro/chair, meanwhile the cost of the model 3P is 40 euro/chair. Finally, the company informs that the minimum number of chairs to produce is 1000 units per month. Define a linear programming model, which minimizes the total cost (the production costs of the two chairs, plus the buying of new wood blocks).

I remember solving this exact problem (down to the four legs versus three legs bit) in grad school.  We used LINGO to do this, though I haven’t seen that language since.  H/T R-Bloggers

Comments closed