Press "Enter" to skip to content

Category: R

Basics Of Dplyr

Dave Mason is dipping his toes into the R waters:

I think my first exposure to R was at PASS Summit 2016. Since then, I’ve made an effort to attend R sessions at SQL Saturdays. The one commonality I seem to find in all of them is a demo with (or mention of) the dplyr package. It’s a package of functions that manipulate data in data frame objects (think of them as SQL Server/relational tables…or if you’re a .NET developer, a System.Data.DataTable object). R feels inexorably tied to dplyr at this early stage for me. R is probably way more vast than I realize, but what would it be without dplyr? Would it still be as popular? Would it still be as powerful?

What’s It Good For

I’m not sure if I’m perceiving this the right way yet, but dplyr sure feels a lot like LINQ, a .NET Framework technology that provides query-like capability for C#. For instance, you can select a subset of objects from an array, sort them, find a minimum or maximum, etc. It’s kind of like querying SQL Server, just without SQL Server.

I like the comparison of dplyr against LINQ, as they’re both data querying and transformation tools whose motif is a series of functions chained together.

Comments closed

Building An Image Recognizer With R

David Smith has a post showing how to build an image recognizer with R and Microsoft’s Cognitive Services Library:

The process of training an image recognition system requires LOTS of images — millions and millions of them. The process involves feeding those images into a deep neural network, and during that process the network generates “features” from the image. These features might be versions of the image including just the outlines, or maybe the image with only the green parts. You could further boil those features down into a single number, say the length of the outline or the percentage of the image that is green. With enough of these “features”, you could use them in a traditional machine learning model to classify the images, or perform other recognition tasks.

But if you don’t have millions of images, it’s still possible to generate these features from a model that has already been trained on millions of images. ResNet is a very deep neural network model trained for the task of image recognition which has been used to win major computer-vision competitions. With the rxFeaturize function in Microsoft R Client and Microsoft R Server, you can generate 4096 features from this model on any image you provide. The features themselves are meaningful only to a computer, but that vector of 4096 numbers between zero and one is (ideally) a distillation of the unique characteristics of that image as a human would recognize it. You can then use that features vector to create your own image-recognition system without the burden of training your own neural network on a large corpus of images.

Read the whole thing and follow David’s link to the Microsoft Cognitive blog for more details.

Comments closed

Checkpointing Code For Reproduction

David Smith tells an interesting story about a reproducibility problem with data analysis:

Timo Grossenbacher, data journalist with Swiss Radio and TV in Zurich, had a bit of a surprise when he attempted to recreate the results of one of the R Markdown scripts published by SRF Data to accompany their data journalism story about vested interests of Swiss members of parliament. Upon re-running the analysis in R last week, Timo was surprised when the results differed from those published in August 2015. There was no change to the R scripts or data in the intervening two-year period, so what caused the results to be different?

The version of R Timo was using had been updated, but that wasn’t the root cause of the problem. What had also changed was the version of the dplyr package used by the script: version 0.5.0 now, versus version 0.4.2 then. For some unknown reason, a change in the dplyr package in the intervening package caused some data rows (shown in red above) to be deleted during the data preparation process, and so the results changed.

Click through for the solution, which is pretty easy in R.

Comments closed

R Services 182 Error

Joey D’Antoni provides a solution to a tricky SQL Server R Services error:

Recently, and unfortunately I don’t have an exact date on when this started failing (though it was around service pack 1 install time) with the following error:

Error
Msg 39012, Level 16, State 1, Line 10
Unable to communicate with the runtime for ‘R’ script. Please check the requirements of ‘R’ runtime.
STDERR message(s) from external script:

DLL ‘C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER1601\MSSQL\Binn\sqlsatellite.dll’ cannot be loaded.
Error in eval(expr, envir, enclos) :
DLL ‘C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER1601\MSSQL\Binn\sqlsatellite.dll’ cannot be loaded.
Calls: source -> withVisible -> eval -> eval -> .Call
Execution halted
STDOUT message(s) from external script:

Failed to load dll ‘C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER1601\MSSQL\Binn\sqlsatellite.dll’ with 182 error.

Click through to see how to resolve this issue.

Comments closed

Understanding K-Means Clustering

Chaitanya Sagar has a good explanation of the assumptions k-means clustering makes:

Why do we assume in the first place? The answer is that making assumptions helps simplify problems and simplified problems can then be solved accurately. To divide your dataset into clusters, one must define the criteria of a cluster and those make the assumptions for the technique. K-Means clustering method considers two assumptions regarding the clusters – first that the clusters are spherical and second that the clusters are of similar size. Spherical assumption helps in separating the clusters when the algorithm works on the data and forms clusters. If this assumption is violated, the clusters formed may not be what one expects. On the other hand, assumption over the size of clusters helps in deciding the boundaries of the cluster. This assumption helps in calculating the number of data points each cluster should have. This assumption also gives an advantage. Clusters in K-means are defined by taking the mean of all the data points in the cluster. With this assumption, one can start with the centers of clusters anywhere. Keeping the starting points of the clusters anywhere will still make the algorithm converge with the same final clusters as keeping the centers as far apart as possible.

Read on as Chaitanya shows several examples; the polar coordinate transformation was quite interesting.  H/T R-Bloggers

Comments closed

Parallel Processing In R

Chaitanya Sagar shows a few methods for parallelizing code in R:

Parallel programming may seem a complex process at first but the amount of time saved after executing tasks in parallel makes it worth the try. Functions such as lapply() and sapply() are great alternatives to time consuming looping functions while parallel, foreach and doParallel packages are great starting points to running tasks in parallel. These parallel processes are based on functions and are also modular. However, with great power comes a risk of code crashes. Hence it is necessary to be careful and be aware of ways to control memory usage and error handling. It is not necessary to parallelize every piece of code that you write. You can always write sequential code and decide to parallelize the parts which take significant amounts of time. This will help in further reducing out of memory instances and writing robust and fast codes. The use of parallel programing method is growing and many packages now have parallel implementations available. With this article. one can dive deep into the world of parallel programming and make full use of the vast memory and processing power to generate output quickly. The full code for this article is as follows.

If you’re using Microsoft R server, there are additional parallelism options. H/T R-Bloggers

Comments closed

Sympathy For The Part-Timer

John Mount wants us to think about part-time users:

The second point I think is particularly interesting. It means:

An R user who does not consider themselves an expert programmer could be maintaining code that they understand, but could not be expected to create from scratch.

Or:

Let’s have some sympathy for the part-time R user.

This is the point we will emphasize in our new example.

Read on for a particular example.  I think this is good advice to generalize:  write your code to make it as easy as possible for “part-time” users.  This applies to custom code you write as well, as unless you are constantly in a particular part of the code base, you’ll forget the details later and have the same problems that a part-timer would have working with a different language.

Comments closed

Explaining Singular Value Decomposition

Tim Bock explains how Singular Value Decomposition works:

The table above is a matrix of numbers. I am going to call it Z. The singular value decomposition is computed using the svd function. The following code computes the singular value decomposition of the matrix Z, and assigns it to a new object called SVD, which contains one vector, d, and two matrices, u and v. The vector, d, contains the singular values. The first matrix, u, contains the left singular vectors, and vcontains the right singular vectors. The left singular vectors represent the rows of the input table, and the right singular vectors represent their columns.

Tim includes R scripts to follow along, and for this topic I definitely recommend following along.

Comments closed

A New ODBC Package For R

David Smith looks at the odbc package in R:

The odbc package is a from-the-ground-up implementation of an ODBC interface for R that provides native support for additional data types (including dates, timestamps, raw binary, and 64-bit integers) and parameterized queries. The odbc package provides connections with any ODBC-compliant database, and has been comprehensively tested on SQL Server, PostgreSQL and MySQL. Benchmarks show that it’s also somewhat faster than RODBC: 3.2 times faster for reads, and 1.9 times faster for writes.

Sounds like odbc lets you run ad hoc queries and also lets you use dplyr as an ORM, similar to using Linq in C#.

Comments closed

sparklyr 0.6 Released

Javier Luraschi announces sparklyr 0.6:

We’re excited to announce a new release of the sparklyr package, available in CRAN today! sparklyr 0.6 introduces new features to:

  • Distribute R computations using spark_apply() to execute arbitrary R code across your Spark cluster. You can now use all of your favorite R packages and functions in a distributed context.

  • Connect to External Data Sources using spark_read_source()spark_write_source()spark_read_jdbc() and spark_write_jdbc().

  • Use the Latest Frameworks including dplyr 0.7DBI 0.7RStudio 1.1and Spark 2.2.

I’ve been impressed with sparklyr so far.

Comments closed