Category: R

Optimal predictions sit close to, or on, the dashed line in the graphic below, i.e. where the prediction for each observation equals the actual. The Root Mean Squared Error (RMSE) measures the average differences, so should be as small as possible. And R-squared measures the correlation between prediction and actual, where 0 reflects no correlation, and 1 perfect positive correlation.

Our supervised machine learning outcomes from the CART and GLMmodels have weaker RMSEs, and visually exhibit some dispersion in the predictions at higher counts. Stochastic Gradient Boosting, Cubist and Random Forest have handled the higher counts better as we see from the visually tighter clustering.

It was Random Forest that produced marginally the smallest prediction error. And it was a parameter unique to the Random Forest model which almost tripped me up as discussed in the supporting documentation.

Also be sure to read his notebook to get the full story. H/T R-Bloggers

Comments closed

Using drop = FALSE On Data Frames

Published 2018-03-01 by Kevin Feasel

John Mount explains why you might want to add drop = FALSE to your data.frame operations:

We were merely trying to re-order the rows and the result was converted to a vector. This happened because the rules for [ , ] change if there is only one result column. This happens even if the there had been only one input column. Another example is: d[,] is also vector in this case.

The issue is: if we are writing re-usable code we are often programming before we know complete contents of a variable or argument. For a data.frame named “g” supplied as an argument: g[vec, ] can be a data.frame or a vector (or even possibly a list). However we do know if g is a data.frame then g[vec, , drop = FALSE] is also a data.frame(assuming vec is a vector of valid row indices or a logical vector, note: NA induces some special cases).

We care as vectors and data.frames have different semantics, so are not fully substitutable in later code.

Definitely read the comments on this one as well, as John extends his explanation and others chime in with very useful notes.

Comments closed

Microsoft ML Server 9.3 Released

Published 2018-02-28 by Kevin Feasel

Nagesh Pabbisetty announces Microsoft Machine Learning Server 9.3:

In ML Server 9.3, we have added support for SQL compute context in ML Server and in R Client running on Linux platforms, so data scientists who work on Linux workstations can directly use in-database analytics with SQL Server compute context. Additionally, the SQLRUtils package can now be used to package the R scripts into T-SQL stored procedures and run them from R environment on Linux clients.

An interesting scenario enabled by the addition of SQL Server Compute context in ML Server running on Linux is that organizations can now provide a browser-based interface for accessing SQL Server compute context with R Studio Server and ML Server running on a Linux machine connecting to SQL Server.

Since introducing revoscalepy library in the last release of ML Server and SQL Server 2017, we have shipped several additions and improvements in the Python APIs as part of CU releases of SQL Server 2017. We have added APIs like rx_create_col_info, rx_get_var_info etc. that make it easier to get column information, esp. with large number of columns. We added rx_serialize_model for easy model serialization. We have also improved performance when working with string data in different scenarios.

This also gets you up to R 3.4.3. H/T David Smith

Comments closed

Looping In Python And R

Published 2018-02-28 by Kevin Feasel

Dmitry Kisler has a quick comparison of looping speed in Python and R:

This post is about R versus Python in terms of the time they require to loop and generate pseudo-random numbers. To accomplish the task, the following steps were performed in Python and R (1) loop 100k times ( $i$ is the loop index) (2) generate a random integer number out of the array of integers from 1 to the current loop index $i$ ( $i$ +1 for Python) (3) output elapsed time at the probe loop steps: $i$ ( $i$ +1 for Python) in [10, 100, 1000, 5000, 10000, 25000, 50000, 75000, 100000]

The findings were mostly unsurprising to me, though there was one unexpected twist.

Comments closed

Launching A Sparklyr Cluster

Published 2018-02-27 by Kevin Feasel

David Smith shows how to launch a sparklyr cluster in Azure:

When you’re finished, shut down your cluster using the aztk spark cluster delete command. (While you can delete the nodes from the Pools view in the Azure portal, the command does some additional cleanup for you.) You’ll be charged for each node in the cluster at the usual VM rates for as long as the cluster is provisioned. (One cost-saving option is to use low-priority VMs for the nodes, for savings of up to 90% compared to the usual rates.)

That’s it! Once you get used to it, it’s all quick and easy — the longest part is waiting for the cluster to spin up in Step 5. This is just a summary, but the full details see the guide SparklyR on Azure with AZTK.

It’ll take a bit more than five minutes to get started, but it is a good sight easier than building the servers yourself.

Comments closed

Web Analytics With R

Published 2018-02-22 by Kevin Feasel

Maelle Salmon performs some analysis on the Locke Data blog:

Often, the URL of a blog post can be guessed based on its title, e.g. this one can be read here. But even if the transition from the Markdown file information to an URL is logical, it was best to get URLs from the in situ blog posts, and then join them to the blog post information collected previously, since some special characters got special treatment that I could not fully understand by looking at blogdown source code.

I first extracted all posts URLs from the website map.

Check it out.

Comments closed

Using stringr To Remove HTML

Published 2018-02-22 by Kevin Feasel

I have a quick post on removing HTML markup with stringr:

This is a quick post today on removing HTML tags using the stringr package in R.

My purpose here is in taking some raw data, which can include HTML markup, and preparing it for a vectorizer. I don’t need the resulting output to look pretty; I just want to get rid of the HTML characters.

Click through for the script. If you need to do something nice with the text afterward, my technique is probably too much sledgehammer for niceties, but it does the trick for pre-processing before vectorization.

Comments closed

Loops Versus Apply: Speed Comparison

Published 2018-02-19 by Kevin Feasel

Mike Spencer compares lapply (single core and its multi-core version) versus a for loop in R:

But how fast were they? Can we get faster? Thankfully R provides `system.time()` for timing code execution. In order to get faster, it makes sense to use all the processing power our machines have. The ‘parallel’ library has some great tools to help us run our jobs in parallel and take advantage of multicore processing. My favourite is `mclapply()`, because it is very very easy to take an `lapply` and make it multicore. Note that mclapply doesn’t work on Windows. The following script runs the `read_clean_write()` function in a for loop (boo, hiss), lapply and mclapply. I’ve run these as list elements to make life easier later on.

It’s interesting reading, particularly because I had expected lapply to do a little bit better. Also interesting is the relative overhead cost of mclapply in this scenario: going from 1 core to 4 cut the time to approximately 1/3, not 1/4.

Comments closed

Quoted Concatenation In R

Published 2018-02-19 by Kevin Feasel

John Mount has a quick tip for R users:

Here is an R tip. Need to quote a lot of names at once? Use qc().

This function is part of wrapr.

Comments closed

The Basics Of PCA In R

Published 2018-02-16 by Kevin Feasel

Prashant Shekhar gives us an overview of Principal Component Analysis using R:

PCA changes the axis towards the direction of maximum variance and then takes projection on this new axis. The direction of maximum variance is represented by Principal Components (PC1). There are multiple principal components depending on the number of dimensions (features) in the dataset and they are orthogonal to each other. The maximum number of principal component is same as a number of dimension of data. For example, in the above figure, for two-dimension data, there will be max of two principal components (PC1 & PC2). The first principal component defines the most of the variance, followed by second principal component, third principal component and so on. Dimension reduction comes from the fact that it is possible to discard last few principal components as they will not capture much variance in the data.

PCA is a useful technique for reducing dimensionality and removing covariance.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31