Press "Enter" to skip to content

Category: R

Building ML Services Models

I have part two of my three-part series on SQL Server Machine Learning Services modeling:

We used sp_execute_external_script to build a model and generate some number of days worth of predictions.  Now I’d like to break that up into two operations:  training a model and generating predictions.  The biggest reason I might want to do this is if model generation is very expensive relative to prediction generation.  It’s obviously not very expensive to build a random number generator following a Poisson distribution, but imagine we had a complicated neural net that took hours or days to train—we’d want to use that model over and over, as long as the model came close enough to representing the underlying reality.

So now we’re going to create a model.

Click through to see a more complete example, something closer to production-ready.

Comments closed

Including A Progress Bar In R

Peter Solymos has an update to his pbapply library:

The pbapply R package that adds progress bar to vectorized functions has been know to accumulate overhead when calling parallel::mclapply with forking (see this post for more background on the issue). Strangely enough, a GitHub issue held the key to the solution that I am going to outline below. Long story short: forking is no longer expensive with pbapply, and as it turns out, it never was.

H/T R-Bloggers

Comments closed

Building A Model Using SQL Server ML Services

I have a post which shows how to build a simple R model to predict demand for an item:

I am a huge fan of the Poisson distribution.  It is special in that its one parameter (lambda) represents both the mean and the variance of the distribution.  At the limit, a Poisson distribution becomes normal.  But it’s most useful in helping us pattern infrequently-occurring events.  For example, selling 3-4 watches per day.

Estimating a Poisson is also easy in R:  lambda is simply the mean of your sample counts, and there is a function called rpois() which takes two parameters:  the number of events you want to generate and the value of lambda.

So what I want to do is take my data from SQL Server, feed it into R, and return back a prediction for the next seven days.

This was a simple post, but the next two in the series will expand upon it and build out a full implementation.

Comments closed

Beware Multi-Assignment dplyr::mutate() Statements

John Mount hits on an issue when using dplyr backed by a database in R:

Notice the above gives an incorrect result: all of the x_i columns are identical, and all of the y_i columns are identical. I am not saying the above code is in any way desirable (though something like it does arise naturally in certain test designs). If this is truly “incorrect dplyr code” we should have seen an error or exception. Unless you can be certain you have no code like that in a database backed dplyr project: you can not be certain you have not run into the problem producing silent data and result corruption.

The issue is: dplyr on databases does not seem to have strong enough order of assignment statement execution guarantees. The running counter “delta” is taking only one value for the entire lifetime of the dplyr::mutate() statement (which is clearly not what the user would want).

Read on for a couple of suggested solutions.

Comments closed

Microsoft R Open 3.4.3

David Smith announces Microsoft R Open 3.4.3:

Microsoft R Open (MRO), Microsoft’s enhanced distribution of open source R, has been upgraded to version 3.4.3 and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to the latest R (version 3.4.3) and updates the bundled packages (specifically: checkpointcurldoParallelforeach, and iterators) to new versions.

MRO is 100% compatible with all R packages. MRO 3.4.3 points to a fixed CRAN snapshot taken on January 1 2018, and you can see some highlights of new packages released since the prior version of MRO on the Spotlights page. As always, you can use the built-in checkpoint packageto access packages from an earlier date (for reproducibility) or a later date (to access new and updated packages).

That brings Microsoft up to speed with base R.

Comments closed

Setting Up SparklyR In Azure

David Smith shows how you can spin up a Spark cluster in Azure and install SparklyR on top of it:

The SparklyR package from RStudio provides a high-level interface to Spark from R. This means you can create R objects that point to data frames stored in the Spark cluster and apply some familiar R paradigms (like dplyr) to the data, all the while leveraging Spark’s distributed architecture without having to worry about memory limitations in R. You can also access the distributed machine-learning algorithms included in Spark directly from R functions.

If you don’t happen to have a cluster of Spark-enabled machines set up in a nearby well-ventilated closet, you can easily set one up in your favorite cloud service. For Azure, one option is to launch a Spark cluster in HDInsight, which also includes the extensions of Microsoft ML Server. While this service recently had a significant price reduction, it’s still more expensive than running a “vanilla” Spark-and-R cluster. If you’d like to take the vanilla route, a new guide details how to set up Spark cluster on Azure for use with SparklyR.

Read on for more details.

Comments closed

Zippy Base R

John Mount defends the honor of base R:

The graph summarizes the performance of four solutions to the “scoring logistic regression by hand” problem:

  • Optimized Base R: a specialized “pre allocate and work with vectorized indices” method. This is fast as it is able to express our particular task in a small number of purely base R vectorized operations. We are hoping to build some teaching materials about this methodology.

  • Idiomatic Base R (shown dashed): an idiomatic R method using stats::aggregate() to solve the problem. This method is re-plotted in both graphs as a dashed line and works as a good division between what is fast versus what is slow.

  • data.table: a straightforward data.table solution (another possible demarcation between fast and slow).

  • dplyr (no grouped filter): a dplyr solution (tuned to work around some known issues).

Read the whole thing, including the comments section, where there’s a good bit of helpful back-and-forth.

Comments closed

Non-English Natural Language Processing

The folks at BNOSAC have announced a new natural language processing toolkit for R:

BNOSAC is happy to announce the release of the udpipe R package (https://bnosac.github.io/udpipe/en) which is a Natural Language Processing toolkit that provides language-agnostic ‘tokenization’, ‘parts of speech tagging’, ‘lemmatization’, ‘morphological feature tagging’ and ‘dependency parsing’ of raw text. Next to text parsing, the package also allows you to train annotation models based on data of ‘treebanks’ in ‘CoNLL-U’ format as provided at http://universaldependencies.org/format.html.

The package provides direct access to language models trained on more than 50 languages.

Click through to check it out.

Comments closed

Fun With Tibbles

Theo Roe provides an introduction to tibbles in R:

Tibbles are a modern take on data frames, but crucially they are still data frames. Well, what’s the difference then? There’s a quote I found somewhere on the internet that decribes the difference quite well;

“keeping what time has proven to be effective, and throwing out what is not”.

Basically, some clever people took the classic data.frame(), shook it til the ineffective parts fell out, then added some new, more appropriate features.

I probably don’t do enough with tibbles, but the upside is that in most cases, there’s a smooth transition.

Comments closed