Category: R

Rolling Calculations In R

Published 2018-03-09 by Kevin Feasel

Steph Locke explains some tricky behavior with window functions in R:

So looking at the code I wrote, you may have expectedc2 to hold NA, 3, 5, ... where it’s taking the current value and the prior value to make a window of width 2. Another reasonable alternative is that you may have expected c2 to hold NA, NA, 3, ... where it’s summing up the prior two values. But hey, it’s kinda working like cumsum() right so that’s ok! But wait, check out c3. I gave c3 a window of width 3 and it gave me NA, 6, 9, ... which looks like it’s summing the prior value, the current value, and the next value. …. That’s weird right?

It turns out the default behaviour for these rolling calculations is to center align the window, which means the window sits over the current value and tries it’s best to fit over the prior and next values equally. In the case of us giving it an even number it decided to put the window over the next values more than the prior values.

Knowing your window is critical when using a window function, and knowing that some functions have different default windows than others helps you be prepared.

Comments closed

Disagreement On Outliers

Published 2018-03-09 by Kevin Feasel

Antony Unwin reviews how various packages track outliers using the Overview of Outliers plot in R:

The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the OutliersO3 vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

Interesting reading.

Comments closed

Data Modeling And Neural Networks

Published 2018-03-09 by Kevin Feasel

I have two new posts in my launching a data science project series. The first one covers data modeling theory:

Wait, isn’t self-supervised learning just a subset of supervised learning? Sure, but it’s pretty useful to look at on its own. Here, we use heuristics to guesstimate labels and train the model based on those guesstimates. For example, let’s say that we want to train a neural network or Markov chain generator to read the works of Shakespeare and generate beautiful prose for us. The way the recursive model would work is to take what words have already been written and then predict the most likely next word or punctuation character.

We don’t have “labeled” data within the works of Shakespeare, though; instead, our training data’s “label” is the next word in the play or sonnet. So we train our model based on the chains of words, treating the problem as interdependent rather than a bunch of independent words just hanging around.

Then, we implement a data model using a neural network:

At this point, I want to build the Keras model. I’m creating a build_model function in case I want to run this over and over. In a real-life scenario, I would perform various optimizations, do cross-validation, etc. In this scenario, however, I am just going to run one time against the full training data set, and then evaluate it against the test data set.

Inside the function, we start by declaring a Keras model. Then, I add three layers to the model. The first layer is a dense (fully-connected) layer which accepts the training data as inputs and uses the Rectified Linear Unit (ReLU) activation mechanism. This is a decent first guess for activation mechanisms. We then have a dropout layer, which reduces the risk of overfitting on the training data. Finally, I have a dense layer for my output, which will give me the salary.

I compile the model using the RMSProp optimizer. This is a good default optimizer for neural networks, although you might try Adagrad, Adam, or AdaMax as well. Our loss function is Mean Squared Error, which is good for dealing with finding the error in a regression. Finally, I’m interested in the Mean Absolute Error–that is, the dollar amount difference between our function’s prediction and the actual salary. The closer to $0 this is, the better.

Click through for those two posts, including seeing how close I get to a reasonable model with my neural network.

Comments closed

Be Wary Of Colliders When Analyzing Data

Published 2018-03-07 by Kevin Feasel

Keith Goldfeld has an interesting demonstration of a collider variable and how it can lead us to incorrect conclusions during analysis:

In this (admittedly thoroughly made-up though not entirely implausible) network diagram, the test score outcome is a collider, influenced by a test preparation class and socio-economic status (SES). In particular, both the test prep course and high SES are related to the probability of having a high test score. One might expect an arrow of some sort to connect SES and the test prep class; in this case, participation in test prep is randomized so there is no causal link (and I am assuming that everyone randomized to the class actually takes it, a compliance issue I addressed in a series of posts starting with this one.)

The researcher who carried out the randomization had a hypothesis that test prep actually is detrimental to college success down the road, because it de-emphasizes deep thinking in favor of wrote memorization. In reality, it turns out that the course and subsequent college success are not related, indicated by an absence of a connection between the course and the long term outcome.

Read the whole thing. H/T R-Bloggers

Comments closed

XGBoost In R

Published 2018-03-07 by Kevin Feasel

Fisseha Berhane explains how to implement Extreme Gradient Boosting in R:

What makes it so popular are its speed and performance. It gives among the best performances in many machine learning applications. It is optimized gradient-boosting machine learning library. The core algorithm is parallelizable and hence it can use all the processing power of your machine and the machines in your cluster. In R, according to the package documentation, since the package can automatically do parallel computation on a single machine, it could be more than 10 times faster than existing gradient boosting packages.

xgboost shines when we have lots of training data where the features are numeric or a mixture of numeric and categorical fields. It is also important to note that xgboost is not the best algorithm out there when all the features are categorical or when the number of rows is less than the number of fields (columns).

xgboost is a nice complement to neural networks, as they tend to be great at different things.

Comments closed

Data Cleansing With R

Published 2018-03-07 by Kevin Feasel

I continue my series on launching a data science project:

Now that we’ve performed some basic analysis, we will clean up the data set. I’m doing most of the cleanup in a single operation, but I do have some comment notes here, particularly around the oddities with SalaryUSD. The SalaryUSD column has a few problems:

Some people put in pennies, which aren’t really that important at the level we’re discussing. I want to strip them out.

Some people put in delimiters like commas or decimal points (which act as commas in countries like Germany). I want to strip them out, particularly because the decimal point might interfere with my analysis, turning 100.000 to $100 instead of $100K.

Some people included the dollar sign, so remove that, as well as any spaces.

It’s not a perfect regex, but it did seem to fix the problems in this data set at least.

Something I’ve liked about the data professionals survey is that there are a few places with room for data cleansing, but not everything is awful. It’s neither artificially clean nor beyond repair, so it’s good for use as an example.

Comments closed

Avoding Direct View() Calls In R

Published 2018-03-06 by Kevin Feasel

John Mount notes that you should not assume that the View() function in R will work:

R tip: get out of the habit of calling View() directly.

View() only works correctly in interactive environments, not currently in RMarkdown contexts. It is better to call something else that safely dispatches to View(), or to something else depending if you are in an interactive or non-interactive session.

Click through for a script which is safe to run whether you’re in R Studio or using knitr to build a document.

Comments closed

Executing R Scripts In SSRS

Published 2018-03-06 by Kevin Feasel

Tomaz Kastrun shows how to include R scripts (and visuals) in SQL Server Reporting Services:

Using the privileges of R language to enrich your data, your statistical analysis or visualization is a simple task to get more out of your reports.

The best practice to embed R code into SSRS report is to create stored procedure and output the results to report. To demonstrate this, we will create two reports; one that will take two input parameters and the second one to demonstrate the usage of R visualization.

It’s nice to be able to use R to create nice visuals and then import them in your SSRS report, and Tomaz shows how.

Comments closed

Investigating London Crime Data

Published 2018-03-02 by Kevin Feasel

Carl Goodwin digs into London crime data by borough and sees if he can predict crime rates:

Optimal predictions sit close to, or on, the dashed line in the graphic below, i.e. where the prediction for each observation equals the actual. The Root Mean Squared Error (RMSE) measures the average differences, so should be as small as possible. And R-squared measures the correlation between prediction and actual, where 0 reflects no correlation, and 1 perfect positive correlation.

Our supervised machine learning outcomes from the CART and GLMmodels have weaker RMSEs, and visually exhibit some dispersion in the predictions at higher counts. Stochastic Gradient Boosting, Cubist and Random Forest have handled the higher counts better as we see from the visually tighter clustering.

It was Random Forest that produced marginally the smallest prediction error. And it was a parameter unique to the Random Forest model which almost tripped me up as discussed in the supporting documentation.

Also be sure to read his notebook to get the full story. H/T R-Bloggers

Comments closed

Using drop = FALSE On Data Frames

Published 2018-03-01 by Kevin Feasel

John Mount explains why you might want to add drop = FALSE to your data.frame operations:

We were merely trying to re-order the rows and the result was converted to a vector. This happened because the rules for [ , ] change if there is only one result column. This happens even if the there had been only one input column. Another example is: d[,] is also vector in this case.

The issue is: if we are writing re-usable code we are often programming before we know complete contents of a variable or argument. For a data.frame named “g” supplied as an argument: g[vec, ] can be a data.frame or a vector (or even possibly a list). However we do know if g is a data.frame then g[vec, , drop = FALSE] is also a data.frame(assuming vec is a vector of valid row indices or a logical vector, note: NA induces some special cases).

We care as vectors and data.frames have different semantics, so are not fully substitutable in later code.

Definitely read the comments on this one as well, as John extends his explanation and others chime in with very useful notes.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31