Press "Enter" to skip to content

Category: R

Fun With ML Services And VARBINARY

I wrap up my ML Services mini-series by building out a process to predict sales for multiple products using different models:

I have my model as an input and want to spit it out at the end as well. But when I try that, I get an error:

Msg 39017, Level 16, State 3, Line 239
Input data query returns column #1 of type ‘varbinary(max)’ which is not supported by the runtime for ‘R’ script. Unsupported types are binary, varbinary, timestamp, datetime2, datetimeoffset, time, text, ntext, image, hierarchyid, xml, sql_variant and user-defined type.

So there goes that plan—I can output a VARBINARY(MAX) model, but I cannot input one.

Click through to see my workaround.

Comments closed

R Or Python

Tomaz Kastrun shares his thoughts on the topic of R versus Python:

Imag[in]e I ask you, would you prefer Apple iPhone over Samsung Galaxy, respectively? Or if I would ask you, would you prefer BMW over Audi, respectively? In all the cases, both phones or both cars will get the job done. So will Python or R, R or Python. So instead of asking which one I prefer, ask your self, which one suits my environment better? If your background is more statistics and less programming, take R, if you are more into programming and less into statistics, take Python; in both cases you will have faster time to accomplish results with your preferred language. If you ask me, can I do gradient boosting or ANOVA or MDS in Python or in R, the answer will be yes, you can do both in any of the languages.

This graf hits the crux of my opinion on the topic, but as I’ve gone deeper into the topic over the past year, I think the correct answer is probably “both” for a mature organization and “pick the one which suits you better” for beginners.

Comments closed

Building ML Services Models

I have part two of my three-part series on SQL Server Machine Learning Services modeling:

We used sp_execute_external_script to build a model and generate some number of days worth of predictions.  Now I’d like to break that up into two operations:  training a model and generating predictions.  The biggest reason I might want to do this is if model generation is very expensive relative to prediction generation.  It’s obviously not very expensive to build a random number generator following a Poisson distribution, but imagine we had a complicated neural net that took hours or days to train—we’d want to use that model over and over, as long as the model came close enough to representing the underlying reality.

So now we’re going to create a model.

Click through to see a more complete example, something closer to production-ready.

Comments closed

Including A Progress Bar In R

Peter Solymos has an update to his pbapply library:

The pbapply R package that adds progress bar to vectorized functions has been know to accumulate overhead when calling parallel::mclapply with forking (see this post for more background on the issue). Strangely enough, a GitHub issue held the key to the solution that I am going to outline below. Long story short: forking is no longer expensive with pbapply, and as it turns out, it never was.

H/T R-Bloggers

Comments closed

Building A Model Using SQL Server ML Services

I have a post which shows how to build a simple R model to predict demand for an item:

I am a huge fan of the Poisson distribution.  It is special in that its one parameter (lambda) represents both the mean and the variance of the distribution.  At the limit, a Poisson distribution becomes normal.  But it’s most useful in helping us pattern infrequently-occurring events.  For example, selling 3-4 watches per day.

Estimating a Poisson is also easy in R:  lambda is simply the mean of your sample counts, and there is a function called rpois() which takes two parameters:  the number of events you want to generate and the value of lambda.

So what I want to do is take my data from SQL Server, feed it into R, and return back a prediction for the next seven days.

This was a simple post, but the next two in the series will expand upon it and build out a full implementation.

Comments closed

Beware Multi-Assignment dplyr::mutate() Statements

John Mount hits on an issue when using dplyr backed by a database in R:

Notice the above gives an incorrect result: all of the x_i columns are identical, and all of the y_i columns are identical. I am not saying the above code is in any way desirable (though something like it does arise naturally in certain test designs). If this is truly “incorrect dplyr code” we should have seen an error or exception. Unless you can be certain you have no code like that in a database backed dplyr project: you can not be certain you have not run into the problem producing silent data and result corruption.

The issue is: dplyr on databases does not seem to have strong enough order of assignment statement execution guarantees. The running counter “delta” is taking only one value for the entire lifetime of the dplyr::mutate() statement (which is clearly not what the user would want).

Read on for a couple of suggested solutions.

Comments closed

Microsoft R Open 3.4.3

David Smith announces Microsoft R Open 3.4.3:

Microsoft R Open (MRO), Microsoft’s enhanced distribution of open source R, has been upgraded to version 3.4.3 and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to the latest R (version 3.4.3) and updates the bundled packages (specifically: checkpointcurldoParallelforeach, and iterators) to new versions.

MRO is 100% compatible with all R packages. MRO 3.4.3 points to a fixed CRAN snapshot taken on January 1 2018, and you can see some highlights of new packages released since the prior version of MRO on the Spotlights page. As always, you can use the built-in checkpoint packageto access packages from an earlier date (for reproducibility) or a later date (to access new and updated packages).

That brings Microsoft up to speed with base R.

Comments closed

Setting Up SparklyR In Azure

David Smith shows how you can spin up a Spark cluster in Azure and install SparklyR on top of it:

The SparklyR package from RStudio provides a high-level interface to Spark from R. This means you can create R objects that point to data frames stored in the Spark cluster and apply some familiar R paradigms (like dplyr) to the data, all the while leveraging Spark’s distributed architecture without having to worry about memory limitations in R. You can also access the distributed machine-learning algorithms included in Spark directly from R functions.

If you don’t happen to have a cluster of Spark-enabled machines set up in a nearby well-ventilated closet, you can easily set one up in your favorite cloud service. For Azure, one option is to launch a Spark cluster in HDInsight, which also includes the extensions of Microsoft ML Server. While this service recently had a significant price reduction, it’s still more expensive than running a “vanilla” Spark-and-R cluster. If you’d like to take the vanilla route, a new guide details how to set up Spark cluster on Azure for use with SparklyR.

Read on for more details.

Comments closed

Zippy Base R

John Mount defends the honor of base R:

The graph summarizes the performance of four solutions to the “scoring logistic regression by hand” problem:

  • Optimized Base R: a specialized “pre allocate and work with vectorized indices” method. This is fast as it is able to express our particular task in a small number of purely base R vectorized operations. We are hoping to build some teaching materials about this methodology.

  • Idiomatic Base R (shown dashed): an idiomatic R method using stats::aggregate() to solve the problem. This method is re-plotted in both graphs as a dashed line and works as a good division between what is fast versus what is slow.

  • data.table: a straightforward data.table solution (another possible demarcation between fast and slow).

  • dplyr (no grouped filter): a dplyr solution (tuned to work around some known issues).

Read the whole thing, including the comments section, where there’s a good bit of helpful back-and-forth.

Comments closed