Press "Enter" to skip to content

Category: R

Dealing With Heteroskedasticity

Bruno Rodrigues explains the notion of heteroskedasticity and shows ways of dealing with this issue in a linear regression:

This test shows that we can reject the null that the variance of the residuals is constant, thus heteroskedacity is present. To get the correct standard errors, we can use the vcovHC() function from the {sandwich} package (hence the choice for the header picture of this post):

lmfit %>% vcovHC() %>% diag() %>% sqrt()
## (Intercept) regionnortheast regionsouth regionwest
## 311.31088691 25.30778221 23.56106307 24.12258706
## residents young_residents per_capita_income
## 0.09184368 0.68829667 0.02999882

By default vcovHC() estimates a heteroskedasticity consistent (HC) variance covariance matrix for the parameters. There are several ways to estimate such a HC matrix, and by default vcovHC() estimates the “HC3” one. You can refer to Zeileis (2004) for more details.

We see that the standard errors are much larger than before! The intercept and regionwest variables are not statistically significant anymore.

The biggest problem with heteroskedasticity is that it can introduce bias in error terms.  That’s not the end of the world, but if the level of heteroskedasticity is serious enough, we want to find ways to account for it.  H/T R-Bloggers.

Comments closed

Building A Gantt Chart With ggplot2

Sebastian Sauer shows us how to build a gantt chart in R:

Of importance are only TaskPrevious_Evnet and Duration. In addition, we need an overall start date (“2019-03-01” in this case). Each subsequent task is assumed to follow neatly its predecessing event.

Our job is to compute the start date and end date of task given that we know the initial start date and the durations. As said, this procedure is based on the assumption that there is a frictionless and gapless sequence of tasks.

Read on for a code-heavy example.  I’ve always had a soft spot in my heart for gantt charts.

Comments closed

RStudio Integration With Databricks

Brian Dirking, et al, announce support between RStudio and the Databricks platform:

With Databricks RStudio Integration, both popular R packages for interacting with Apache Spark, SparkR or sparklyr can be used the inside the RStudio IDE on Databricks. When multiple users use a cluster, each creates a separate SparkR Context or sparklyr connection, but they are all talking to a single Databricks managed Spark application allowing unique opportunities for collaboration between users. Together, RStudio can take advantage of Databricks’ cluster management and Apache Spark to perform such as a massive model selection as noted in the figure below.

I like seeing this level of integration, especially from a language like R, which has historically been limited to operating on a single machine’s memory.

Comments closed

Gathering Punctuation With tidytext

Julia Silge uses the tidytext package to compare works of literature in terms of punctuation usage:

Commas are the PUNCTUATION WINNER, except in Anne of Green Gables and Ulysses, where periods win out. These two novels are dramatically different from each other in other ways, though, and Ulysses is an outlier overall with almost no spoken dialogue via quotation marks and an unusual use of colons to semicolons. Exclamation marks are used relatively more in Wuthering Heights and Alice in Wonderland!

Exploring text in these kinds of ways is so fun, and tools for this type of text mining are developing so fast. You can incorporate information like this into modeling or statistical analysis; Mike Kearney has a package called textfeatures that lets you directly extract info such as the number of commas or number of exclamation marks from text. Let me know if you have any questions!

Yet more proof that Ulysses was an awful book.

Comments closed

Converting XML To R Dataframes

Joachim Zuckarelli announces a new package:

The new R package flatxml provides functions to easily deal with XML files. When parsing an XML document fxml_importXMLFlat produces a special dataframe that is ‘flat’ by its very nature but contains all necessary information about the hierarchical structure of the underlying XML document (for details on the dataframe see the reference for the fxml_importXMLFlat function). flatxml offers a set of functions to work with this dataframe.

Apart from representing the XML document in a dataframe structure, there is yet another way in which flatxml relates to dataframes: the fxml_toDataFrame function can be used to extract data from an XML document into a dataframe, e.g. to work on the data with statistical functions. Because in this case there is no need to represent the XML document structure as such (it’s all about the data contained in the document), there is no representation of the hierarchical structure of the document any more, it’s just a normal dataframe.

Very interesting.  I’ve struggled a bit more with the xml2 package than I’d care to admit, so I might give this one a try.  H/T R-bloggers

Comments closed

Using LIME To Explain Keras Models

Shirin Glander shows us how to use the LIME package to explain image recognition models built from Keras:

The segmentation of an image into superpixels are an important step in generating explanations for image models. It is both important that the segmentation is correct and follows meaningful patterns in the picture, but also that the size/number of superpixels are appropriate. If the important features in the image are chopped into too many segments the permutations will probably damage the picture beyond recognition in almost all cases leading to a poor or failing explanation model. As the size of the object of interest is varying it is impossible to set up hard rules for the number of superpixels to segment into – the larger the object is relative to the size of the image, the fewer superpixels should be generated. Using plot_superpixels it is possible to evaluate the superpixel parameters before starting the time-consuming explanation function.

Fun stuff.  I’m glad that there’s a lot of work going into explaining neural networks rather than hand-waving them off as magic.

Comments closed

Learning R Or Python?

David Smith tackles the age-old question:

If your interests lean more towards traditional statistical analysis and inference as used within industries like manufacturing, finance, and the life sciences, I’d lean towards R. If you’re more interested in machine learning and artificial intelligence applications, I’d lean towards Python. But even that’s not a hard-and-fast rule: R has excellent support for machine learning and deep learning frameworks, and Python is often used for traditional data science applications.

One thing I am quite sure of though: neither Python nor R is inherently better than the other, and arguments on that front are ultimately futile. (Trust me, I’ve been there.) Which is better for any given person depends on a wide variety of factors, and for some, it may even be worthwhile to learn both. Brian Ray recently posted a good overview of the factors that may lead you towards R or Python for data science: their history, the community, performance, third-party support, use cases, and even how to use them together. It’s great food for thought if you’re trying to decide which community to invest in.

Embrace the power of “and.”  The whole R versus Python bit is fun for purposes of arguing with people, but they’re both powerful languages and we’re seeing more and more overlap—for example, the Keras package David mentions runs Python’s TensorFlow under the covers.

Comments closed

Graphics In R

David Smith is following the kerfuffle that Edward Tufte unleashed on Twitter recently:

While graphics guru Edward Tufte recently claimed that “R coders and users just can’t do words on graphics and typography” and need additonal tools to make graphics that aren’t “clunky”, data journalists at major publications beg to differ. The BBC has been creating graphics “purely in R” for some time, with a typography style matching that of the BBC website. Senior BBC Data Journalist Christine Jeavans offers several examples, including this chart of life expectancy differences between men and women:

I think Tufte’s off base here.

Comments closed

Counting Arguments In R

Neil Saunders shares methods for interrogating argument lists in R:

“Some R functions have an awful lot of arguments”, you think to yourself. “I wonder which has the most?”

It’s not an original thought: the same question as applied to the R base package is an exercise in the Functions chapter of the excellent Advanced R. Much of the information in this post came from there.

There are lots of R packages. We’ll limit ourselves to those packages which ship with R, and which load on startup. Which ones are they?

It’s a fun exercise and helpful for learning a bit more about how to work with arguments when metaprogramming in R.

Comments closed