If you have a database of credit-card transactions with a small percentage tagged as fraudulent, how can you create a process that automatically flags likely fraudulent transactions in the future? That’s the premise behind the latest Data Science Deep Dive on MSDN. This tutorial provides a step by step to using the R language and the big-data statistical models of the RevoScaleR package of SQL Server 2016 R Services to build and use a predictive model to detect fraud.
This looks to be a follow-up from the fraud detection series.
Operations that are conceptually simple can be difficult to perform using SQL. Consider the common requirements to pivot or transpose a dataset. Each of these actions are conceptually straightforward but are complex to implement using SQL. The examples that follow are somewhat verbose, but the details are not significant. The main point is to illustrate is that, by using specialized functions outside of SQL, R makes trivial some of those operations that would otherwise require complex SQL statements. The contrast in the amount of code required is striking. The simpler approach allows you to focus attention on the scientific or business problem at hand, rather than expending energy reading documentation or laboriously testing complex statements.
I consider this where the second-order value of R comes in. The initial “wow” factor is in how easy you can plot things, and this ease of data cleansing is the next big time-saver.
If you were using CTP 3.0 and later ran an in-place upgrade to CTP 3.2 this will silently break R Services. Uninstalling and reinstalling the R component will not fix the problem, but it can be fixed. There are a few interrelated issues here so bear with me.
Hopefully you don’t run into this issue, but if you do, at least there’s a fix.
Another mistake I see a lot in beginning R students is forgetting that R cares about case. In other words, the variable “a” is a separate thing than the variable “A”.
NOTE: Package names can be case-sensitive as well.
A lot of this comes down to “learn the syntax.”
However, it seems that there might be two kinks in the line:
The first kink occurs somewhere between the 800m distance and the mile. It seems that the sprinting distances (and the 800m is sometimes called a long sprint) has different dynamics from the events up to the marathon.
The analysis is done in R, and the code is available in the post. Check it out.
But R is also part of an entire ecosystem of open tools that can be linked together. For example, Markdown, Pandoc, and knitr combine to make R an incredible tool for dynamic reporting and reproducible research. If your chosen output format is HTML, you’ve linked into yet another open ecosystem with countless further extensions.
Generating a page from R is one of those good ideas that I probably don’t want to see in a production environment.
Not only can we create and download custom visuals from PowerBI.com to extend the capabilities of Power BI, we can use R to create a ridiculous amount of powerful visualizations. If you can get the data into Power BI, you can use R to perform interesting statistical analysis and create some pretty cool, interactive visuals.
Dustin and Jan Mulkens are working on similar posts at the same time, so watch both of them.
Jan Mulkens has started a series on combining Power BI and R.
Fact is, R is here to stay. Even Microsoft has integrated R with SQL Server 2016 and it has made R scripting possible in it’s great Azure Machine Learning service.
So it was only a matter of time before we were going to see R integrated in Power BI.
From the previous point, it seems that R is just running in the background and that most of the functionality can be used.
Testing some basic functionality like importing and transforming data in the R visual worked fine.
I haven’t tried any predictive modelling yet but I assume that will just work as well.
So instead of printing “Hello world” to the screen, we’ll use a simple graph to say hello to the world.
First we need some data, Power BI enables us to enter some data in a familiar Excel style.
Just select “Enter Data” and start bashing out some data.
I’m looking forward to the rest of the series.
So I went through and converted everything in my Rtraining to this and realised it messed up my slide decks – it’s been so long since I had built a pure knitr solution that I forgot that
knitr::knit. For my slidedecks, if I wanted the ioslides_presentation format, I needed to use
rmarkdown::render. The problem with that has been the relative references to the CSS and the logo.
To solve this I read about the custom render formats capability and created afunction that produces an ioslides_presentation but with my CSS preloaded by default. This now means that I can produce slides with better file referencing.
Steph has put up all of her R-related presentations and documentation as well, so check that out.
Detecting fraudulent transactions is a key applucation of statistical modeling, especially in an age of online transactions. R of course has many functions and packages suited to this purpose, including binary classification techniques such as logistic regression.
If you’d like to implement a fraud-detection application, the Cortana Analytics gallery features an Online Fraud Detection Template. This is a step-by step guide to building a web-service which will score transactions by likelihood of fraud, created in five steps
Read through for the five follow-up articles. This is a fantastic series and I plan to walk through it step by step myself.