Press "Enter" to skip to content

Category: R

Imbalanced Data In R

Rathnadevi Manivannan explains how to deal with imbalanced data using R:

Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Imbalanced classification occurs more frequently in binary classification than in multi-level classification. For example, extreme imbalanced data can be seen in banking or financial data where majority credit card uses are acceptable and very few credit card uses are fraudulent.

With an imbalanced dataset, the information required to make an accurate prediction about the minority class cannot be obtained using an algorithm. So, it is recommended to use balanced classification dataset.

Rathnadevi uses fraudulent transactions for his sample, but medical diagnoses is also a good example:  suppose 1 person in 10,000 has a particular disease.  You’re 99.99% right if you just say nobody has the disease, but that’s a rather unhelpful model.

Comments closed

Sentiment Analysis In R

Rachel Tatman has a great tutorial introducing sentiment analysis in R:

By the end of this tutorial you will:

  • Understand what sentiment analysis is and how it works
  • Read text from a dataset & tokenize it
  • Use a sentiment lexicon to analyze the sentiment of texts
  • Visualize the sentiment of text

If you’re the hands-on type, you might want to head directly to the notebook for this tutorial. You can fork it and have your very own version of the code to run, modify and experiment with as we go along.

Check it out.  There’s a lot more to sentiment analysis—cleaning and tokenizing words, getting context right, etc.—but this is a very nice introduction.

Comments closed

Sparklines In R

Robert Sheldon shows how to use SQL Server R Services to display sparklines for categories:

In this article, we continue our discussion on visualizations, but switch the focus to sparklines and other spark graphs. As with many aspects of the R language, there are multiple options for generating spark graphs. For this article, we’ll focus on using the sparkTable package, which allows us to create spark graphs and build tables that incorporate those graphs directly, a common use case when working with spark images.

In the examples to follow, we’ll import the sparkTable package and generate several graphs, based on data retrieved from the AdventureWorks2014 sample database. We’ll also build a table that incorporates the SQL Server data along with the spark graphs. Note, however, that this article focuses specifically on working with the sparkTable package. If you are not familiar with how to build R scripts that incorporate SQL Server data, refer to the previous articles in this series. You should understand how to use the sp_execute_external_script stored procedure to retrieve SQL Server data and run R scripts before diving into this article.

Sparklines and associated visuals have their place in the world.  Read on to see how you can build a report displaying them.

Comments closed

Errors Using Native Prediction In SQL Server

Sacha Tomey walks us through a few potential issues when converting code which uses SQL Server Machine Learning Services’s sp_execute_external_script procedure to native PREDICT calls:

Stumble One:

Error occurred during execution of the builtin function 'PREDICT' with HRESULT 0x80004001.
Model type is unsupported.

Reason:

Not all models are supported. At the time of writing, only the following models are supported:

  • rxLinMod
  • rxLogit
  • rxBTrees
  • rxDtree
  • rxdForest

sp_rxPredict supports additional models including those available in the MicrosoftML package for R (I was using attempting to use rxFastTrees). I presume this limitation will reduce over time. The list of supported models is referenced in the PREDICT function (Documentation).

sp_rxPredict does require CLR, but it’s a viable alternative if you need to use a model not currently supported—like rxNeuralNet.

Comments closed

Custom Snippets In RStudio

Mara Averick shows how to create a custom code snippet in RStudio:

The RStudio Support Code Snippets post is a great step-by-step for adding snippets of your own. The gist of it for a markdown snippet is as follows:

  1. Open RStudio Preferences

  2. Go to the Code section

  3. Click the Edit Snippets button3

  4. Select Markdown

  5. Add your snippet and Save

Click through for a demo of how to embed tweets into your RMarkdown documentation.

Comments closed

Regular Expressions With R

Dave Mason looks at using SQL Server R Services to execute regular expressions against a T-SQL data set:

Have you ever had the need to use Regular Expressions directly in SQL Server? I sometimes hear or see others refer to using RegEx in TSQL. But I always assume they’re talking about the TSQL LIKE operator, because RegEx isn’t natively supported. In TSQL’s defence, you can get a lot of mileage out of LIKE and some clever pattern matching strings, even though it’s not authentic RegEx. You can leverage RegEx libraries in the .NET Framework via a CLR stored procedure. You should also be able to do something similar with an old-school extended stored procedure.

I discussed all of this during a recent interview. It was a day or two afterwards (of course) when it dawned on me that there’s another way to leverage RegEx from TSQL: the R language. Prior to this mini-revelation, I had always thought of R (and Python) as strictly a means to an end for Data Science and related disciplines. Now I am thinking I’ve been looking at R and Python through too narrow of a lens and I should take a larger view.

I think I’d prefer CLR for this because there’s additional overhead to making R Services calls, but it’s a clever use of R Services.

Comments closed

SQL Server R Services Troubleshooting

Ginger Grant walks us through a few troubleshooting tactics with SQL Server 2016 R Services:

Much to my surprise after this I received an error

Msg 39019, Level 16, State 1, Line 1
An external script error occurred:
Unable to launch the runtime. ErrorCode 0x80070490: 1168(Element not found.).
Msg 11536, Level 16, State 1, Line 1
EXECUTE statement failed because its WITH RESULT SETS clause specified 1 result set(s), but the statement only sent 0 result set(s) at run time.

I looked in the log files and didn’t find any errors.  I checked the configuration manager to ensure that I had some user ids configured in the configuration manager.  Nothing seemed to make any difference.  Looking online, the only error that I saw which might possibly be close was a different error message about 8.3 naming and the working directory.

This service is somewhat finicky to set up in my experience, though once you have it configured, it tends to be pretty stable.

Comments closed

Recognizing Wood Knot Images

Bob Horton and Vanja Paunic walk through a lumber grading scenario with Microsoft R Server:

Here we use the rxFeaturize function from Microsoft R Server, which allows us to perform a number of transformations on the knot images in order to produce numerical features. We first resize the images to fit the dimensions required by the pre-trained deep neural model we will use, then extract the pixels to form a numerical data set, then run that data set through a DNN pre-trained model. The result of the image featurization is a numeric vector (“feature vector”) that represents key characteristics of that image.

Image featurization here is accomplished by using a deep neural network (DNN) model that has already been pre-trained by using millions of images. Currently, MRS supports four types of DNNs – three ResNet models (18, 50, 101)[1] and AlexNet [8].

This is a practical example of how to use image recognition to facilitate machine learning.

Comments closed

A Cheat Sheet For Purrr

Mara Averick has a list of resources for learning more about purrr:

Purrr royal decree (ok, I’ll stop with the 🐱  puns now), the purrr 📦  now has its very own official RStudio cheat sheetApply Functions Cheat Sheet

The purrr package makes it easy to work with lists and functions. This cheatsheet will remind you how to manipulate lists with purrr as well as how to apply functions iteratively to each element of a list or vector. The back of the cheatsheet explains how to work with list-columns. With list columns, you can use a simple data frame to organize any collection of objects in R.

So, I thought we’d celebrate with a bit of a purrr 🐦  tweet roundup:

Purrr is one of those libraries I know I need to learn more about, and this looks like a good starting point.

Comments closed

Visualizing Networks With R

Arthur Charpentier shows off some of the functionality of igraph:

The good thing is that a lot of functions are available. For instance, we can get shortest paths between two specific nodes. And we can give appropriate colors to the nodes that we’ll cross:

> AP=all_shortest_paths(iflo,
+ from=”Peruzzi”,
+ to=”Ginori”)
> L=AP$res[[1]]
> V(iflo)$color=”yellow”
> V(iflo)$color[L[2:4]]=”light blue”
> V(iflo)$color[L[c(1,5)]]=”blue”
> plot(iflo)

Click through for a demo-heavy example.

Comments closed