Press "Enter" to skip to content

Category: R

Parallelization With Rcpp

Blazej Moska demonstrates how to use Rcpp to parallelize R code:

One of the frustrating moments while working with data is when you need results urgently, but your dataset is large enough to make it impossible. This happens often when we need to use algorithm with high computational complexity. I will demonstrate it on the example I’ve been working with.

Suppose we have large dataset consisting of association rules. For some reasons we want to slim it down. Whenever two rules consequents are the same and one rule’s antecedent is a subset of second rule’s antecedent, we want to choose the smaller one (probability of obtaining smaller set is bigger than probability of obtaining bigger set).

Read the whole thing.

Comments closed

Plotting Graph Data In R

Sifiso Ndlovu shows how to take graph data from SQL Server and plot it in R using Machine Learning Services:

However, with recent focus on big data for many of my clients, we have experienced an increase in different business requests that requires for many-to-many data modelling. Consequently, as a Microsoft shop we’ve had to turn to other non-Microsoft products to ensure that we optimally respond to such business requests. Not surprisingly, ever since word got around that graph database will be part of SQL Server 2017, we’ve been looking forward to this latest release of SQL Server. Having played around with the graph database feature in SQL Server 2017, we have noticed that unlike other graph database vendors, plotting and visualising the data out of the graph database is not readily available in SQL Server 2017. Luckily, thanks to SQL Server R, you can easily plot and visualise SQL Server 2017 graph database data without turning to 3rd party plugins. In this article, I demonstrate how SQL Server Machine Learning Services (previously known as SQL Server 2016 R Services) can be used to plot a diagram according to the data defined in a SQL Server 2017 graph database.

The igraph library is a good one; there’s a lot of power in it that this post just introduces.

Comments closed

ML Services Can Fill The Plan Cache

I have a post talking about a bug in SQL Server:

For now, the workaround I have is to restart the SQL Server service occasionally. You can see that I have done it twice in the above screenshot. Our application is resilient to short database downtimes, so this isn’t a bad workaround for us; it’s just a little bit of an annoyance.

One thing to keep in mind if you are in this scenario is that if you are running ML Services hundreds of thousands of times a day, your ExtensibilityData folders might have a lot of cruft which may prevent the Launchpad service from starting as expected. I’ve had to delete all folders in \MSSQL14.MSSQLSERVER\MSSQL\ExtensibilityData\MSSQLSERVER01 after stopping the SQL Server service and before restarting it. The Launchpad service automatically does this, but if you have a huge number of folders in there, the service can time out trying to delete all of them.  In my experience at least, the other folders didn’t have enough sub-folders inside to make it worth deleting, but that may just be an artifact of how we use ML Services.

It’s very unlikely to affect most shops, as we only notice it after running sp_execute_external_script millions of times, and that’s pretty abnormal behavior.

Comments closed

Project-Oriented R Development

Jenny Bryan explains how building projects in R can reduce the likelihood that someone will come in and set your computer on fire:

I suggest organizing each data analysis into a project: a folder on your computer that holds all the files relevant to that particular piece of work. I’m not assuming this is an RStudio Project, though this is a nice implementation discussed below.

Any resident R script is written assuming that it will be run from a fresh R process with working directory set to the project directory. It creates everything it needs, in its own workspace or folder, and it touches nothing it did not create. For example, it does not install additional packages (another pet peeve of mine).

This convention guarantees that the project can be moved around on your computer or onto other computers and will still “just work”. I argue that this is the only practical convention that creates reliable, polite behavior across different computers or users and over time. This convention is neither new, nor unique to R.

I admit that I’m just now getting into using projects regularly for my one-off stuff.  This is very good advice.  H/T David Smith

Comments closed

Scraping The PASS Budget

Steph Locke shows us how to scrape a PDF, specifically, the PASS operating budget:

With tabulizer, if the data is relatively well formatted in a PDF you can use tabulizer::extract_tables(). This gives you a bunch of data.frames which you can process. Unfortunately, in the case of the PASS budget with 22 pages of tables, including tables that span multiple pages, we’re not so lucky!

We need to fall back to tabulizer::extract_text() and do a lot of wrangling to reconstruct the tables.

Steph shows her work, so click through to see the scripts.

Comments closed

Outlier Detection With dplyr And ruler

Evgeni Chasnovski shows how to use a couple R packages in concert to find outliers:

During the process of data analysis one of the most crucial steps is to identify and account for outliers, observations that have essentially different nature than most other observations. Their presence can lead to untrustworthy conclusions. The most complicated part of this task is to define a notion of “outlier”. After that, it is straightforward to identify them based on given data.

There are many techniques developed for outlier detection. Majority of them deal with numerical data. This post will describe the most basic ones with their application using dplyrand ruler packages.

After reading this post you will know:

  • Most basic outlier detection techniques.

  • A way to implement them using dplyr and ruler.

  • A way to combine their results in order to obtain a new outlier detection method.

  • A way to discover notion of “diamond quality” without prior knowledge of this topic (as a happy consequence of previous point).

Read the whole thing.  H/T R-Bloggers

Comments closed

rquery: Relational Algebra In R

John Mount announces rquery:

rquery is Win-Vector LLC‘s currently in development big data query tool for R.

rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with RSQL, and dplyr at big data scale in production).

As an example: rquery operators allow us to write our earlier “treatment and control” example as follows.

dQ <- d %.>%
  extend_se(.,
            if_else_block(
              testexpr =
                "rand()>=0.5",
              thenexprs = qae(
                a_1 := 'treatment',
                a_2 := 'control'),
              elseexprs = qae(
                a_1 := 'control',
                a_2 := 'treatment'))) %.>%
  select_columns(., c("rowNum", "a_1", "a_2"))

It’s an interesting idea.

Comments closed

Getting Started With dplyr

Abdul Majed Raja has a dplyr tutorial:

dplyr is one of the most popular r-packages and also part of tidyverse that’s been developed by Hadley Wickham. The mere fact that dplyr package is very famous means, it’s one of the most frequently used. Being a data scientist is not always about creating sophisticated models but Data Analysis (Manipulation) and Data Visualization play a very important role in BAU of many us – in fact, a very important part before any modeling exercise since Feature Engineering and EDA are the most important differentiating factors of your model and someone else’s.
Hence, this post aims to bring out some well-known and not-so-well-known applications of dplyr so that any data analyst could leverage its potential using a much familiar – Titanic Dataset.

This covers the main pieces  of dplyr, including its pipeline.  dplyr is a key part of the tidyverse, and knowing it well makes R so much easier.  H/T R-Bloggers

Comments closed

The ggplot2 Books

Hadley Wickham has a couple of books which teach a lot about ggplot2.  The first book I’d recommend is his and Garrett Grolemund’s R For Data Science book, which is available for free online:

To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside aes(). ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will also add a legend that explains which levels correspond to which values.

The colors reveal that many of the unusual points are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.

Wickham also has the source to build his ggplot2 book online.  If you don’t want to build the source, you also have the option of buying the book.

Comments closed