Using mrsdeploy To Run R On Azure

Kevin Feasel


Cloud, R

John-Mark Agosta shows how to use mrsdeploy to send R batch jobs up to an Azure VM:

Alternately there are other Azure platforms for operationalization using R Server in the Marketplace, with other operating systems and platforms including HDInsight, Microsoft’s Hadoop offering. Or, equivalently one could use the Data Science VM available in the Marketplace, since it has a copy of R Server installed. Configuration of these platforms is similar to the example covered in this posting.

Provisioning an R Server VM, as reference in the documentation, takes a few steps that are detailed here, which consist of configuring the VM and setting up the server account to authorize remote access. To set up the server you’ll use the system account you set up as a user of the Linux machine. The server account is used for client interaction with the R Server, and should not be confused with the Linux system account. This is a major difference with the Windows version of the R Server VM that uses Active Directory services for authentication.

You can also use mrsdeploy to run batch jobs against Microsoft R Server on a local Hadoop cluster.

The Hive Metastore In HDInsight

Ashish Thapliyal shows how to create a custom Hive metastore in HDInsight:

Custom Metastore – HDInsight lets you pick custom Metastore. It’s a recommended approach for production clusters due to number reasons such as

  • You bring your own Azure SQL database as Metastore

  • As lifecycle of Metastore is not tied to a cluster lifecycle, you can create and delete clusters without worrying about the metadata loss.

  • Custom Metastore lets you attach multiple clusters and cluster types to same Metastore. Example – Single Metastore can be shared across Interactive Hive, Hive and Spark clusters in HDInsight

  • You pay for the cost of Metastore (Azure SQL DB)

Read on to see how to do this.

The Tidyverse Curse

Kevin Feasel



Bob Muenchen notes a structural conflict between R and its most common set of packages:

There’s a common theme in many of the sections above: a task that is hard to perform using base a R function is made much easier by a function in the dplyr package. That package, and its relatives, are collectively known as the tidyverse. Its functions help with many tasks, such as selecting, renaming, or transforming variables, filtering or sorting observations, combining data frames, and doing by-group analyses. dplyr is such a helpful package that shows that it is the single most popular R package (as of 3/23/2017.) As much of a blessing as these commands are, they’re also a curse to beginners as they’re more to learn. The main packages of dplyr, tibble, tidyr, and purrr contain a few hundred functions, though I use “only” around 60 of them regularly. As people learn R, they often comment that base R functions and tidyverse ones feel like two separate languages. The tidyverse functions are often the easiest to use, but not always; its pipe operator is usually simpler to use, but not always; tibbles are usually accepted by non-tidyverse functions, but not always; grouped tibbles may help do what you want automatically, but not always (i.e. you may need to ungroup or group_by higher levels). Navigating the balance between base R and the tidyverse is a challenge to learn.

Interesting read.  As Bob notes in the comments, he’s still a fan of the tidyverse, but it’s important to recognize that there are pain points there.

Writing A Better Abstract

Kevin Feasel



Adnan Fiaz reviews conference abstracts for patterns:

Certainly an interesting graph! It may have been better to show the proportions instead of counts as the number of abstracts in each category are not equal. Nevertheless, the conclusion remains the same. The words “r” and “data” are clearly the most common. However, what is more interesting is that abstracts in the “yes” category use certain words significantly more often than abstracts in the “no” category and vice versa (more often because a missing bar doesn’t necessarily mean a zero observation). For example, the words “science”, “production” and “performance” occur more often in the “yes” category. Vice versa, the words “tools”, “product”, “package” and “company(ies)” occur more often in the “no” category. Also, the word “application” occurs in its singular form in the “no” category and in its plural form in the “yes” category. Certainly, at EARL we like our applications to be plural, it is in the name after all.

Granted, this is only abstracts for one conference, but it’s an interesting idea.

Memory-Optimized Clustered Columnstore Indexes

Ned Otter contrasts memory-optimized clustered columnstore indexes with traditional, disk-based clustered columnstore indexes:

5. Parallelism

On-disk: When you query an on-disk table that has a columnstore index, the database engine can use parallelism to process the results more quickly.

Memory-optimized: When you query a memory-optimized table that has a columnstore index, the database engine can use parallelism to process the results more quickly, BUT, that statement is only true if you use interop. Natively compiled modules are always executed serially.

Click through for the rest of the comparison points as well as a repro script.

Don’t Hard-Code Values

Kevin Feasel



Jana Sattainathan argues against hard-coding values in queries:

I have heard arguments for doing this type source code

  • This is a one-time thing. We do not have the need to do it anywhere else

  • We are on a deadline

  • We do not have the ability to test if this was not done this way

  • My program is going away in a week

  • We do not have the time to correct this

  • I am just following the existing pattern

  • Unofficially (not) said – “This is my job security”

I’m with Jana in principle, but there are performance costs at the margin, making this less of a hard-and-fast rule than I’d like.

Identity Integers And Columnstore

Niko Neugebauer looks at how clustered columnstore indexes handle identity integers with respect to data load times:

This blog post will try to respond this question from the perspective of the data loading performance.

For this research I decided to pick 3 distinct scenarios to investigate, which refer to different ways to approach the solution:
– a CCI table with an Identity column
– a CCI table with a Sequence as a default value
– a CCI table without Identity

There’s a pretty substantial performance difference, so this is well worth the read for large columnstore data loads.

Random Number Lists

Kevin Feasel



Kenneth Fisher one method of creating lists of random numbers:

I was working on a blog post this weekend that required a list of random numbers. Now, this isn’t exactly an uncommon task, but it’s not as easy as it would seem. The random number function RAND() has a few inherent flaws. And from what I can remember the random functions from most languages have the same issue.

First a few quotes from BOL about RAND()

Returns a pseudo-random float value from 0 through 1, exclusive.

Note: pseudo-random

If a seed is not specified, the SQL Server Database Engine assigns a seed value at random. For a specified seed value, the result returned is always the same.

If you don’t specify the seed it gets selected at random. But that’s only once per query run, not once per row of the output.

Read on for Kenneth’s solution.


March 2017
« Feb Apr »