Investigating London Crime Data

Carl Goodwin digs into London crime data by borough and sees if he can predict crime rates:

Optimal predictions sit close to, or on, the dashed line in the graphic below, i.e. where the prediction for each observation equals the actual. The Root Mean Squared Error (RMSE) measures the average differences, so should be as small as possible. And R-squared measures the correlation between prediction and actual, where 0 reflects no correlation, and 1 perfect positive correlation.

Our supervised machine learning outcomes from the CART and GLMmodels have weaker RMSEs, and visually exhibit some dispersion in the predictions at higher counts. Stochastic Gradient Boosting, Cubist and Random Forest have handled the higher counts better as we see from the visually tighter clustering.

It was Random Forest that produced marginally the smallest prediction error. And it was a parameter unique to the Random Forest model which almost tripped me up as discussed in the supporting documentation.

Also be sure to read his notebook to get the full story.  H/T R-Bloggers

Related Posts

Data Science And Data Engineering In HDP 3.0

Saumitra Buragohain, et al, show off some of the things added to the Hortonworks Data Platform for data scientists and data engineers: We leverage the power of HDP 3.0 from efficient storage (erasure coding), GPU pooling to containerized TensorFlow and Zeppelin to enable this use case. We will the save the details for a different […]

Read More

Using wrapr For A Consistent Pipe With ggplot2

John Mount shows how you can use the wrapr pipe to perform data processing and building a ggplot2 visual: Now we can run a single pipeline that combines data processing steps and ggplot plot construction. data.frame(x = 1:20) %.>% mutate(., y = cos(3*x)) %.>% ggplot(., aes(x = x, y = y)) %.>% geom_point() %.>% geom_line() %.>% ggtitle("piped ggplot2") Check […]

Read More


March 2018
« Feb Apr »