Press "Enter" to skip to content

Category: R

R Or M?

Ryan Wade gives a few scenarios in which R might be a better language choice than M for Power BI integration:

When referring to what can be done in iOS, Apple often say that there is an “app” for that. Likewise, when R developers refer to what can be done in R, we often say that there is a “package” for that. For instance:

· If one needs to scrap data from the web there are packages for that (rvest, rcurl, and others)

· If one needs to make complicated transformations to their data there are packages for that (dplyr, tidyr, lubrdiate, stringr, and others)

I like the F#-ness of M, but I admit that I’m happy there’s some fairly close R integration within Power BI, as that means there’s one fewer language I need to learn right now…

Comments closed

Analytic Tool Usage

Alex Woodie notes the increased popularity of Python for data analysis:

According to the results of the 2016 survey, R is the preferred tool for 42% of analytics professionals, followed by SAS at 39% and Python at 20%. While Python’s placing may at first appear to relegate the language to Bronze Medal status, it’s the delta here that really matters.

It’s interesting to see the breakdowns of who uses which language, comparing across industry, education, work experience, and geographic lines.

Comments closed

Missing Libraries With SQL Server R Services

Tomaz Kastrun has a script to check and install missing packages in SQL Server R Services code:

Result in this case will be successful with correct R results and sp_execute_external_script will not return error for missing libraries.

I added a “fake” library called test123 for testing purposes if all the libraries will be installed successfully.

At the end the script generated xp_cmdshell command (in one line)

This is a rather clever solution to a problem which I’d rather not exist.  There really ought to be a better way for authorized users programmatically to install packages.

Comments closed

Using Focus() On Correlations

Simon Jackson explains how to use the focus() function in R to narrow down a data frame of correlation coefficients based on a subset of variables:

focus() works similarly to select() from the dplyr package (which is loaded along with the corrr package). You add the names of the columns you wish to keep in your correlation data frame. Extending select(), focus()will then remove the remaining column variables from the rows. This is whympg does not appear in the rows above. Here’s another example with two variables:

Click through for the entire article.

Comments closed

Understanding Bookmakers’ Odds Using R

Andrew Collier looks at odds, vigs, and other bookmaking concepts through the lens of the R programming language:

The house edge is 2.70%. On average a gambler would lose 2.7% of his stake per game. Of course, on any one game he would either win or lose, but this is the long term expectation. Another way of looking at this is to say that the Return To Player (RTP) is 97.3%, which means that on average a gambler would get back 97.3% of his stake on every game.

Below are the results of a simulation of 100 gamblers betting on even numbers. Each starts with an initial capital of 100. The red line represents the average for the cohort. After 1000 games two gamblers have lost all of their money. Of the remaining 98 players, only 24 have made money while the rest have lost some portion of their initial capital.

This is a very interesting article if you’re interested in basic statistics.  13-year-old Onion article of note.

Comments closed

Microsoft R Server On Spark

Max Kaznady, et al, discuss using Microsoft R Server on Spark to perform rapid prototyping against the NYC Taxi dataset:

Once the cluster is created, you can connect to the edge node where MRS is already pre-installed by SSHing to r-server.YOURCLUSTERNAME-ssh.azurehdinsight.net with the credentials which you supplied during the cluster creation process. In order to do this in MobaXterm, you can go to Sessions, then New Sessions and then SSH.

The default installation of HDI Spark on Linux cluster does not come with RStudio Server installed on the edge node. RStudio Server is a popular open source integrated development environment (IDE) available for R that provides a browser-based IDE for use by remote clients. This tool allows you to benefit from all the power of R, Spark and Microsoft HDInsight cluster through your browser. In order to install RStudio you can follow the steps detailed in the guide, which reduces to running a script on the edge node.

If you’ve been meaning to get further into Spark & R, this is a great article to follow along with on your own.

Comments closed

Quality Graphics With R

David Smith discusses building high-quality visuals with R:

Note the use of an attractive colour pallette, style-compatible fonts, and even the official Olympic icons for the sports. I just took a screenshot here, but if you click through to the actual site you’ll notice that these graphics are also scale-independent (you can zoom in on your browser and they’ll look better, not worse) and even interactive (pop-ups appear with country-specific data when you hover over a bar).

Duc-Quang has been generous enough to provide the R code behind these charts if you’d like to try your hand at something similar. The data themselves were scraped from the official Rio 2016 site. The bar charts were created using a standard geom_bar plot using ggplot2, with a custom theme to set the font to OpenSans Condensed. The interactive elements were added using the ggiraph package and the geom_bar_interactive function. The chart titles (including the icons) were created as HTML headers directly, which was then exported along with the interactive charts using the save_html function.

I’m impressed that this all comes from R.  There’s a good bit of work involved in getting this going, but you can get professional-grade graphics quality with R, and that’s pretty cool.

Comments closed

Markov Chains

Sergey Bryl has an introductory-level post on what Markov chains are and how they work:

Using Markov chains allow us to switch from heuristic models to probabilistic ones. We can represent every customer journey (sequence of channels/touchpoints) as a chain in a directed Markov graph where each vertex is a possible state (channel/touchpoint) and the edges represent the probability of transition between the states (including conversion.) By computing the model and estimating transition probabilities we can attribute every channel/touchpoint.

Let’s start with a simple example of the first-order or “memory-free” Markov graph for better understanding the concept. It is called “memory-free” because the probability of reaching one state depends only on the previous state visited.

Markov chains are great for behavior prediction and sentence formation.  This is part one of a series I will eagerly anticipate.  H/T R Bloggers.

Comments closed

Installing R Packages In SQL Server

Tomaz Kastrun shows how to install packages in SQL Server R Services:

Julie Koesmarno made a great post on installing R packages. Please follow this post. Also Microsoft suggests the following way to install R packages on MSDN.

Since I wanted to be able to have packages installed directly from SQL Server Management Studio (SSMS) here is yet another way to do it. I have used xp_cmdshell to install any additional package for my R (optionally you can setEXECUTE AS USER).

This is a bit of a backdoor method, but it does work.

Comments closed

Understanding ROC Curves

Bob Horton explains ROC curves and shows how to create them in R:

ROC curves are commonly used to characterize the sensitivity/specificity tradeoffs for a binary classifier. Most machine learning classifiers produce real-valued scores that correspond with the strength of the prediction that a given case is positive. Turning these real-valued scores into yes or no predictions requires setting a threshold; cases with scores above the threshold are classified as positive, and cases with scores below the threshold are predicted to be negative. Different threshold values give different levels of sensitivity and specificity. A high threshold is more conservative about labelling a case as positive; this makes it less likely to produce false positive results but more likely to miss cases that are in fact positive (lower rate of true positives). A low threshold produces positive labels more liberally, so it is less specific (more false positives) but also more sensitive (more true positives). The ROC curve plots true positive rate against false positive rate, giving a picture of the whole spectrum of such tradeoffs.

ROC curves are one of the primary techniques for figuring out if a binary classifier “works.”

Comments closed