Press "Enter" to skip to content

Category: R

R 3.4.0 Now Available

A new version of R is now available:

  • Accumulating vectors in a loop is faster – Assigning to an element of a vector beyond the current length now over-allocates by a small fraction. The new vector is marked internally as growable, and the true length of the new vector is stored in the truelength field. This makes building up a vector result by assigning to the next element beyond the current length more efficient, though pre-allocating is still preferred. The implementation is subject to change and not intended to be used in packages at this time.

There’s a big list of changes, so check it out and think about upgrading.

Comments closed

Building An Online R Training Environment

Steph Locke has shared how she put together a training lab for her R workshop:

This starts with the tidyverse & Rstudio then:

  • adds the requisite programs for dependencies in my package and whois for mkpasswd to be able to work

  • installs packages from github, notably the one designed to facilitate the day of text analysis

  • get the shell script and the csv from the gist

  • make the shell script executable and then run it

I loved the business card touch.  It’s easy enough to print out little strips of paper with the username and password, but this has a bit more staying power.

Comments closed

Logistic Regression In R

Steph Locke has a presentation on performing logistic regression using R:

Logistic regressions are a great tool for predicting outcomes that are categorical. They use a transformation function based on probability to perform a linear regression. This makes them easy to interpret and implement in other systems.

Logistic regressions can be used to perform a classification for things like determining whether someone needs to go for a biopsy. They can also be used for a more nuanced view by using the probabilities of an outcome for thinks like prioritising interventions based on likelihood to default on a loan.

It’s a good introduction to an important statistical method.

Comments closed

Feature Improvements In Microsoft R Server 9.1

David Smith gives us a nice roundup of feature improvements in Microsoft R Server 9.1:

Interoperability between Microsoft R Server and sparklyr. You can now use RStudio’s sparklyr package in tandem with Microsoft R Server in a single Spark session

New machine learning models in Hadoop and Spark. The new machine learning functions introduced with Version 9.0 (such as FastRank gradient-boosted trees and GPU-accelerated deep neural networks) are now available in the Hadoop and Spark contexts in addition to standalone servers and within SQL Server.

I have been looking forward to these.

Comments closed

SQL Server ML Services

SQL Server R Services is now SQL Server Machine Learning Services and supports Python.  First, Nagesh Pabbisetty and Sumit Kumar talk about Python support:

The addition of Python builds on the foundation laid for R Services in SQL Server 2016 and extends that mechanism to include Python support for in-database analytics and machine learning. We are renaming R Services to Machine Learning Services, and R and Python are two options under this feature.

The Python integration in SQL Server provides several advantages:

  • Elimination of data movement: You no longer need to move data from the database to your Python application or model. Instead, you can build Python applications in the database. This eliminates barriers of security, compliance, governance, integrity, and a host of similar issues related to moving vast amounts of data around. This new capability brings Python to the data and runs code inside secure SQL Server using the proven extensibility mechanism built in SQL Server 2016.

  • Easy deployment: Once you have the Python model ready, deploying it in production is now as easy as embedding it in a T-SQL script, and then any SQL client application can take advantage of Python-based models and intelligence by a simple stored procedure call.

  • Enterprise-grade performance and scale: You can use SQL Server’s advanced capabilities like in-memory table and column store indexes with the high-performance scalable APIs in RevoScalePy package. RevoScalePy is modeled after RevoScaleR package in SQL Server R Services. Using these with the latest innovations in the open source Python world allows you to bring unparalleled selection, performance, and scale to your SQL Python applications.

  • Rich extensibility: You can install and run any of the latest open source Python packages in SQL Server to build deep learning and AI applications on huge amounts of data in SQL Server. Installing a Python package in SQL Server is as simple as installing a Python package on your local machine.

  • Wide availability at no additional costs: Python integration is available in all editions of SQL Server 2017, including the Express edition.

Nagesh Pabbisetty also announces Microsoft R Server 9.1:

We took the first step with Microsoft R Server 9.0, and this follow on release includes significant innovations such as:

  • New machine learning enhancements and inclusion of pre-trained cognitive models such as sentiment analysis & image featurizers

  • SQL Server Machine Learning Services with integrated Python in Preview

  • Enterprise grade operationalization with real-time scoring and dynamic scaling of VMs

  • Deep customer & ISV partnerships to deliver the right solutions to customers

  • A panoply of sources to help you get started with ease

And Joseph Sirosh indicates that AI is where the money is:

So today it’s my pleasure to announce the first RDBMS with built-in AIa production-quality Community Technology Preview (CTP 2.0) of SQL Server 2017. In this preview release, we are introducing in-database support for a rich library of machine learning functions, and now for the first time Python support (in addition to R). SQL Server can also leverage NVIDIA GPU-accelerated computing through the Python/R interface to power even the most intensive deep-learning jobs on images, text, and other unstructured data. Developers can implement NVIDIA GPU-accelerated analytics and very sophisticated AI directly in the database server as stored procedures and gain orders of magnitude higher throughput. In addition, developers can use all the rich features of the database management system for concurrency, high-availability, encryption, security, and compliance to build and deploy robust enterprise-grade AI applications.

There’s a lot to digest here.

Comments closed

Temporal Tables For R Source Control

Tomaz Kastrun shares an unorthodox way of collecting historical R code changes:

I will not comment on the solution Bob provided, since I don’t know how their infrastructure, roles, security is set up. At this point, I am grateful for his comment. But what I will comment, is that there is no straightforward way or any out-of-the-box solution. Furthermore, if your R code requires any additional packages, storing the packages with your R code is not that bad idea, regardless of traffic or disk overhead. And versioning the R code is something that is for sure needed.

To continue from previous post, getting or capturing R code, once it gets to Launchpad, is tricky. So storing R code it in a database table or on file system seems a better idea.

It’s an interesting concept.  My preference is to use R Tools for Visual Studio and a more traditional source control mechanism.  It involves keeping source control up to date, but that’s a good practice to follow in any case.

Comments closed

Parameters In rmarkdown Reports

Steph Locke shows how to use table parameters in rmarkdown reports:

The recent(ish) advent of parameters in rmarkdown reports is pretty nifty but there’s a little bit of behaviour that can come in handy but doesn’t come across in the documentation. You can use table parameters for rmarkdown reports.

Previously, if you wanted to produce multiple reports based off a dataset, you would make the dataset available and then perform filtering in the report. Now we can pass the filtered data directly to the report, which keeps all the filtering logic in one place.

It’s actually super simple to add table parameters for rmarkdown reports.

Click through to see the script.  As promised, it is in fact easy to do.

Comments closed

Generating Homoglyphs In R

Bob Rudis shows how to create homoglyphs (character sequences which look similar to other character sequences) using a few R packages:

We can try it out with a very familiar domain:

(converted <- to_homoglyph("google.com"))
## [1] "ƍ၀໐|.com"

Now, that’s using all possible homoglyphs and it might not look like google.com to you, but imagine whittling down the list to ones that are really close to Latin character set matches. Or, imagine you’re in a hurry and see that version of Google’s URL with a shiny, green lock icon from Let’s Encrypt. You might not really give it a second thought if the page looked fine (or were on a mobile browser without a location bar showing).

Click through for more details, as well as information on punycode.

Comments closed

When Binomials Converge

Mala Mahadevan shows an example of the central limit theorem in action, as a large enough sample from a binomial distribution approximates the normal:

An easier way to do it is to use the normal distribution, or central limit theorem. My post on the theorem illustrates that a sample will follow normal distribution if the sample size is large enough. We will use that as well as the rules around determining probabilities in a normal distribution, to arrive at the probability in this case.
Problem: I have a group of 100 friends who are smokers.  The probability of a random smoker having lung disease is 0.3. What are chances that a maximum of 35 people wind up with lung disease?

Click through for the example.

Comments closed

Logistic Regression With R

Raghavan Madabusi runs through a sample logistic regression:

Input Variables: These variables are called as predictors or independent variables.

  • Customer Demographics (Gender and Senior citizenship)
  • Billing Information (Monthly and Annual charges, Payment method)
  • Product Services (Multiple line, Online security, Streaming TV, Streaming Movies, and so on)
  • Customer relationship variables (Tenure and Contract period)

Output Variables: These variables are called as response or dependent variables. Since the output variable (Churn value) takes the binary form as “0” or “1”, it will be categorized under classification problem in the supervised machine learning.

One of the interesting things in this post was the use of missmap, which is part of Amelia.

Comments closed