Multivariate Analysis In R

Kevin Feasel



Mala Mahadevan looks at using R to describe data sets with two explanatory variables:

From the plot we can see that type 3 trees have the smallest circumference while type 4 have the largest, with type 2 close to type 4. We can also see that type 1 trees have the thinnest dispersion of circumference while type 4 has the highest, closely followed by type 2.  We can also see that there are no significant outliers in this data.

Understanding whether variables are categorical or continuous is vital to understanding what you can and should do with them.

Related Posts

R Data Frames And stringsAsFactors

John Mount recommends setting stringsAsFactors = FALSE for data frames in R: R often uses a concept of factors to re-encode strings. This can be too early and too aggressive. Sometimes a string is just a string. Tibbles have this set by default.  For an explanation as to why it defaults to TRUE for data frames, Roger […]

Read More


John Mount explains the vtreat package that he and Nina Zumel have put together: When attempting predictive modeling with real-world data you quicklyrun into difficulties beyond what is typically emphasized in machine learning coursework: Missing, invalid, or out of range values. Categorical variables with large sets of possible levels. Novel categorical levels discovered during test, cross-validation, or […]

Read More


December 2016
« Nov Jan »