Category: R

Let’s install the package if it hasn’t been installed yet. The easiest way to do that is to run RGUI.exe that came with your SQL Server 2017 In-Database Machine Learning installation. You can find it here:

C:\Program Files\MSSQL14.MSSQLSERVER\R_SERVICES\bin\x64

Take note that you need to run the executable as Administrator. Also, if you’ve installed the R engine prior to your SQL Server 2017 In-Database Machine Learning with R, you have to explicitly tell the R package installer where you want your package installed.

> install.packages("ggplot2", lib="C:\\Program Files\\Microsoft SQL Server\\MSSQL14.MSSQLSERVER\\R_SERVICES\\library", dep = TRUE)

dep = TRUE tells the installer to install dependencies. ggplot2 depends on a lot of other packages. You can check dependencies using MiniCRAN.

Another option for installation is to bootstrap install via T-SQL: you can execute external scripts which run install.packages() directly rather than using RGUI, if that makes more sense with your deployment process.

Comments closed

Visualizing Geo-Spatial Data In R

Published 2018-04-03 by Kevin Feasel

Carson Sievert shows off the plotly library:

You might be wondering, “What can plotly offer over other interactive mapping packages such as leaflet, mapview, mapedit, etc?”. One big feature is the linked brushing framework, which works best when linking plotly together with other plotly graphs (i.e., only a subset of brushing features are supported when linking to other crosstalk-compatible htmlwidgets). Another is the ability to leverage the plotly.js API to make efficient updates in shiny apps via plotlyProxy(). Speaking of efficiency, plotly.js keeps on improving the performance of their WebGL-based rendering, so I recommend trying plot_ly() (with toWebGL()) and/or plot_mapbox() if you have lots of graphical elements to render. Also, by having a consistent interface between these various mapping approaches, it’s much quicker and easier to switch from one approach to another when you need to leverage a different set of strengths and weaknesses.

Plotly’s on my list of things I’ll eventually get to one of these days. H/T R-Bloggers

Comments closed

Using Python Within R

Published 2018-04-02 by Kevin Feasel

David Smith points out new reticulate package:

With reticulate, you can:

Import objects from Python, automatically converted into their equivalent R types. (For example, Pandas data frames become R data.frame objects, and NumPy arrays become R matrix objects.)
Import Python modules, and call their functions from R
Source Python scripts from R
Interactively run Python commands from the R command line
Combine R code and Python code (and output) in R Markdown documents, as shown in the snippet below

The first thing that came to mind when reading this was the implementation of the keras package in R and how it calls out to TensorFlow (written in Python). The ability to make R vs Python an “and” instead of an “or” proposition is quite powerful.

Comments closed

Working With forcats

Published 2018-03-30 by Kevin Feasel

S. Richter-Walsh demonstrates what the forcats R package can do:

Synonymous factor levels

Sometimes a categorical variable may have two or more factor levels that refer to the same group. There may be subtle differences in syntax such as upper case leading letter versus lower case leading letter (GroupA vs. groupA), for example. In this situation, one can use forcats::fct_collapse() to collapse the synonymous levels into one. In our test data, let’s assume that Web and Online refer to the same sales channel and we want to combine both into a factor level called Online….
df$sales <- fct_collapse(df$sales, Online = c("Online", "Web"))

I don’t use forcats that often, but when I do, I definitely appreciate it being here. H/T R-Bloggers

Comments closed

Plotting In R Using ggplot2

Published 2018-03-30 by Kevin Feasel

The folks at Sharp Sight Labs have another nice demo of ggplot2:

You’ve heard me say it a thousand times: to master data science, you need to practice.

You need to “practice small” by practicing individual techniques and functions. But you also need to “practice big” by working on larger projects.

To get some practice, my recommendation is to find reasonably sized datasets online and plot them.

Wikipedia is a nearly-endless source of good datasets. The great thing about Wikipedia is that many of the datasets are small and well contained. They are also fairly clean, with just enough messiness to make them a bit of a challenge.

As a quick example, this week, we’ll plot some economic data.

The code is deceptively easy considering the scope of the problem.

Comments closed

Why Does Empirical Variance Use n-1 Instead Of n?

Published 2018-03-28 by Kevin Feasel

Sebastian Sauer gives us a simulation showing why we use n-1 instead of n as the denominator when calculating the variance of a sample:

Our results show that the variance of the sample is smaller than the empirical variance; however even the empirical variance too is a little too small compared with the population variance (which is 1). Note that sample size was $n = 10$ in each draw of the simulation. With sample size increasing, both should get closer to the “real” (population) sample size (although the bias is negligible for the empirical variance). Let’s check that.

This is an R-heavy post and does a great job of showing that it’s necessary, and ends with recommended reading if you want to understand the why.

Comments closed

ggplot2 Geoms And Aesthetics

Published 2018-03-22 by Kevin Feasel

Tyler Rinker digs into ggplot2’s geoms and aesthetics:

I thought it my be fun to use the geoms aesthetics to see if we could cluster aesthetically similar geoms closer together. The heatmap below uses cosine similarity and heirarchical clustering to reorder the matrix that will allow for like geoms to be found closer to one another (note that today I learned from “R for Data Science” about the seriation package [https://cran.r-project.org/web/packages/seriation/index.html] that may make this matrix reordering task much easier).

It’s an interesting analysis of what’s available within ggplot2 and a detailed look at how different geoms fit together with respect to aesthetic options.

Comments closed

Legible Function Chaining In R

Published 2018-03-22 by Kevin Feasel

John Mount shows a few techniques for legible function chaining with R:

The dot intermediate convention is very succinct, and we can use it with base R transforms to get a correct (and performant) result. Like all conventions: it is just a matter of teaching, learning, and repetition to make this seem natural, familiar and legible.

My preference is to use dplyr + magrittr because I really do like that pipe operator. John’s point is well-taken, however: you don’t need to use the tidyverse to write clean R code, and there can be value in using the base functionality.

Comments closed

R Data Frames And stringsAsFactors

Published 2018-03-20 by Kevin Feasel

John Mount recommends setting stringsAsFactors = FALSE for data frames in R:

R often uses a concept of factors to re-encode strings. This can be too early and too aggressive. Sometimes a string is just a string.

Tibbles have this set by default. For an explanation as to why it defaults to TRUE for data frames, Roger Peng has the story.

Comments closed

vtreat

Published 2018-03-16 by Kevin Feasel

John Mount explains the vtreat package that he and Nina Zumel have put together:

When attempting predictive modeling with real-world data you quicklyrun into difficulties beyond what is typically emphasized in machine learning coursework:

Missing, invalid, or out of range values.

Categorical variables with large sets of possible levels.

Novel categorical levels discovered during test, cross-validation, or model application/deployment.

Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).

Nested model bias poisoning results in non-trivial data processing pipelines.

Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.

vtreat systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.

That’s immediately going onto my learn-more list.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31