Category: R

Benford’s Law

Published 2016-10-31 by Kevin Feasel

Tomaz Kastrun is starting a series on fraud analysis and starts with Benford’s Law:

One of the samples Microsoft provided with release of new SQL Server 2016 was using simple logic of Benford’s law. This law works great with naturally occurring numbers and can be applied across any kind of problem. By naturally occurring, it is meant a number that is not generated generically such as a page number in a book, incremented number in your SQL Table, sequence number of any kind, but numbers that are occurring irrespective from each other, in nature (length or width of trees, mountains, rivers), length of the roads in the cities, addresses in your home town, city/country populations, etc. The law calculates the log distribution of numbers from 1 to 9 and stipulates that number one will occur 30% of times, number two will occur 17% of time, number three will occur 12% of the time and so on. Randomly generated numbers will most certainly generate distribution for each number from 1 to 9 with probability of 1/9. It might also not work with restrictions; for example height expressed in inches will surely not produce Benford function. My height is 188 which is 74 inches or 6ft2. All three numbers will not generate correct distribution, even though height is natural phenomena.

Tomaz includes SQL Server R Services code, so check it out.

Comments closed

R-Hub

Published 2016-10-27 by Kevin Feasel

David Smith discusses a new service to test packages on multiple platforms:

If you’re developing a package for R to share with others — on CRAN, say — you’ll want to make sure it works for others. That means testing it on various platforms (Windows, Mac, Linux, and all the versions thereof), and on various versions of R (current, past, and future). But it’s likely you only have access to one platform, and installing and managing multiple R versions can be a pain.

R-hub, the online package-building service now in public beta, aims to solve this problem by making it easy to build and test your package on a variety of platforms and R versions. Using the rhub R package, you can with a single command upload your package to the cloud-based R-hub service, and build and test your package on the current, prior, and in-development versions of R, using any or all of these platforms

This looks like an interesting service for package developers and companies with a broad distribution of R installations.

Comments closed

R Graph Gallery

Published 2016-10-21 by Kevin Feasel

David Smith points out the new R Graph Gallery:

Once upon a time, there was the original R Graph Gallery, by Romain François. Sadly, it’s been unavailable for several years. Now there’s a new R Graph Gallery to fill the void, created by Yan Holtz. It contains more than 200 data visualizations categorized by type, along with the R code that created them.

You can browse the gallery by types of chart (boxplots, maps, histograms, interactive charts, 3-D charts, etc), or search the chart descriptions. Once you’ve found a chart you like, you can admire it in the gallery (and interact with it, if possible), and also find the R code which you can adapt for your own use. Some entries even include mini-tutorials describing how the chart was made. You can even submit your own graph, if you’d like to have it displayed in the gallery as well.

Looks like a good place to go to get some inspiration.

Comments closed

Deploy SQL Server R Services Without Internet Access

Published 2016-10-21 by Kevin Feasel

Arvind Shyamsundar shows how to install SQL Server R Services on a machine without internet access:

When deploying SQL Server R Services, it is important to note that the setup components for SQL Server do not include the Microsoft R Open and Microsoft R Server components. Those ‘R Components’ (as we will refer to them later in this post) are provided as separate downloadable components. SQL Server will automatically download these when executed on computer which is connected to the Internet. But in cases where setup is done on a computer without Internet access (quite typical of many SQL Server deployments) we need to handle things differently. There is a documented process for doing this. But even with the documentation, we still had some customers with questions on the process.

Inspired by those customer engagements, this blog post walks through the process of setting up SQL Server R Services in environments without Internet access. We walk through a number of scenarios, right from the very basic scenario to the more complex ones involving unattended and ‘smart setup’.

This is a nice walkthrough. I wanted to highlight a link at the end showing how to create a local repository so you can install packages as well.

Comments closed

RTVS 0.5

Published 2016-10-20 by Kevin Feasel

David Smith notes that R Tools for Visual Studio has hit version 0.5:

RTVS also makes it easy to run R code as a SQL Server 2016 stored procedure. (This is a great way to make share the results of R code with other database users while making use of the power of the database for R computations.) The new SQL R Stored Procedure file works with SQL Server R Services to create a stored procedure that embeds R code you create, edit and test within RTVS. This greatly simplifies the process of running R code within SQL Server 2016 as you can see below:

The RTVS team is making good progress. If you passed on RTVS early on, it might be time to take another look.

Comments closed

Sparklyr On EMR

Published 2016-10-19 by Kevin Feasel

Tom Zeng shows how to use sparklyr on Amazon ElasticMapReduce:

The recently released sparklyr package by RStudio has made processing big data in R a lot easier. sparklyr is an R interface to Spark that allows users to use Spark as the backend for dplyr, one of the most popular data manipulation packages. sparklyr provides interfaces to Spark packages and also allows users to query data in Spark using SQL and develop extensions for the full Spark API.

You can also install sparklyr locally and point to a Spark cluster.

Comments closed

R Services Resource Utilization

Published 2016-10-19 by Kevin Feasel

Ginger Grant shows off some R Services reports to see how hard the developers are battering your poor servers with their R scripts:

R Services – Extended Events is also not a report but a list of all the extended events that are available for R Services. This is a handy bit of information, which can be a great reference tool for extended events monitoring. R Services – Packages lists the packages which are currently installed on SQL Server. When people write R, many lot of different packages are used within the script. Prior to running a package, check the information on this report to ensure the libraries used are installed on SQL Server. If the library is missing the code will not work. R Services – Resource Usage is a great way to see how R has been configured to run on the server. Notice I have created an External Pool for R. This is a configuration recommended by Microsoft to better monitor your R Services.

Click through for more information, and grab the reports from Microsoft’s Github repo.

Comments closed

Linear Models

Published 2016-10-19 by Kevin Feasel

Andrea Spano, et al, are starting a new book:

This chapter is an introduction to the first section of the book, Linear Models, and contain some theoretical explanation and lots of examples. At the end of the chapter you will find two summary tables with Linear model formulae and functions in R and Common R functions for inference.

The book is just getting started, but you can get it from the Quantide website. In the meantime, there are two other books on learning R and developing in R. These books are licensed Creative Commons, so they’re free to read and share.

Comments closed

Machine Learning Algorithms In R

Published 2016-10-18 by Kevin Feasel

Ginger Grant has a list of machine learning algorithms and their implementations in R:

Often times determining which algorithm to use can take a while. Here is a pretty good flowchart for determining which algorithm should be used given some examples of what the desired outcomes and data contain. The diagram lists the algorithms, which are implemented in Azure ML. The same algorithms can be implemented in R. In R there are libraries to help with nearly every task. Here’s a list of libraries and their accompanying links which can be used in Machine Learning. This list is no means comprehensive as there are libraries and functions other than the ones listed here, but if you are trying to write a Machine Learning Experiment in R, and are looking at the flowchart, these R functions and Libraries will provide the tools to do the types of Machine Learning Analysis listed.

I think algorithm determination is one of the most difficult parts of machine learning. Even if you don’t mean to go there, the garden of forking paths is dangerous.

Comments closed

Association Rules

Published 2016-10-17 by Kevin Feasel

Tomaz Kastrun discusses product variants:

To sum up, association rules is a great and powerful algorithm for finding the correlations between items and the fact that you can use this straight from SSMS, it just gives me goosebumps. Currently just the performance is a bit of a drawback. Also comparing this algorithm to Analysis services (SSAS) association rules, there are many advantages on R side, because of maneuverability and extracting the data to T-SQL, but keep in mind, SSAS is still very awesome and powerful tool for statistical analysis and data predictions.

Figuring out variations after the fact is an all-too-common task, and this is a good way of getting some ideas on how to do that.

Comments closed