Survival Analysis

Joseph Rickert explains what survival analysis is and shows an example with R:

Looking at the Task View on a small screen is a bit like standing too close to a brick wall – left-right, up-down, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As well-organized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. (I would be remiss not to mention that we all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.) Newcomers, people either new to R or new to survival analysis or both, must find it overwhelming. So, it is with newcomers in mind that I offer the following slim trajectory through the task view that relies on just a few packages: survival, KMsurv, Oisurv and ranger

The survival package, which began life as an S package in the late ’90s, is the cornerstone of the entire R Survival Analysis edifice. Not only is the package itself rich in features, but the object created by the Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R.

Survival analysis is an interesting field of study.  In engineering fields, the most common use is calculating mean time to failure, but that’s certainly not the only place you’re liable to see it.

Outliers In Histograms

Edwin Thoen has an interesting solution to a classic problem with histograms:

Two strategies that make the above into something more interpretable are taking the logarithm of the variable, or omitting the outliers. Both do not show the original distribution, however. Another way to go, is to create one bin for all the outlier values. This way we would see the original distribution where the density is the highest, while at the same time getting a feel for the number of outliers. A quick and dirty implementation of this would be

hist_data %>% mutate(x_new = ifelse(x > 10, 10, x)) %>% ggplot(aes(x_new)) + geom_histogram(binwidth = .1, col = "black", fill = "cornflowerblue")

Edwin then shows a nicer solution, so read the whole thing.

The Birthday Problem

Mala Mahadevan explains the Birthday problem and demonstrates it with SQL and R:

Given a room of 23 random people, what are chances that two or more of them have the same birthday? 

This problem is a little different from the earlier ones, where we actually knew what the probability in each situation was.

What are chances that two people do NOT share the same birthday? Let us exclude leap years for now..chances that two people do not share the same birthday is 364/365, since one person’s birthday is already a given. In a group of 23 people, there are 253 possible pairs (23*22)/2. So the chances of no two people sharing a birthday is 364/365 multiplied 253 times. The chances of two people sharing a birthday, then, per basics of probability, is 1 – this.

The funny thing for me is that I’ve had the Birthday problem explained three separate times using as a demo the 20-30 people in the classroom.  In none of those three cases was there a match, so although I understand that it is correct and how it is correct, the 100% failure to replicate led a little nagging voice in the back of my mind to discount it.

R 3.4.0 Now Available

A new version of R is now available:

  • Accumulating vectors in a loop is faster – Assigning to an element of a vector beyond the current length now over-allocates by a small fraction. The new vector is marked internally as growable, and the true length of the new vector is stored in the truelength field. This makes building up a vector result by assigning to the next element beyond the current length more efficient, though pre-allocating is still preferred. The implementation is subject to change and not intended to be used in packages at this time.

There’s a big list of changes, so check it out and think about upgrading.

Building An Online R Training Environment

Steph Locke has shared how she put together a training lab for her R workshop:

This starts with the tidyverse & Rstudio then:

  • adds the requisite programs for dependencies in my package and whois for mkpasswd to be able to work

  • installs packages from github, notably the one designed to facilitate the day of text analysis

  • get the shell script and the csv from the gist

  • make the shell script executable and then run it

I loved the business card touch.  It’s easy enough to print out little strips of paper with the username and password, but this has a bit more staying power.

Logistic Regression In R

Steph Locke has a presentation on performing logistic regression using R:

Logistic regressions are a great tool for predicting outcomes that are categorical. They use a transformation function based on probability to perform a linear regression. This makes them easy to interpret and implement in other systems.

Logistic regressions can be used to perform a classification for things like determining whether someone needs to go for a biopsy. They can also be used for a more nuanced view by using the probabilities of an outcome for thinks like prioritising interventions based on likelihood to default on a loan.

It’s a good introduction to an important statistical method.

Feature Improvements In Microsoft R Server 9.1

David Smith gives us a nice roundup of feature improvements in Microsoft R Server 9.1:

Interoperability between Microsoft R Server and sparklyr. You can now use RStudio’s sparklyr package in tandem with Microsoft R Server in a single Spark session

New machine learning models in Hadoop and Spark. The new machine learning functions introduced with Version 9.0 (such as FastRank gradient-boosted trees and GPU-accelerated deep neural networks) are now available in the Hadoop and Spark contexts in addition to standalone servers and within SQL Server.

I have been looking forward to these.

SQL Server ML Services

SQL Server R Services is now SQL Server Machine Learning Services and supports Python.  First, Nagesh Pabbisetty and Sumit Kumar talk about Python support:

The addition of Python builds on the foundation laid for R Services in SQL Server 2016 and extends that mechanism to include Python support for in-database analytics and machine learning. We are renaming R Services to Machine Learning Services, and R and Python are two options under this feature.

The Python integration in SQL Server provides several advantages:

  • Elimination of data movement: You no longer need to move data from the database to your Python application or model. Instead, you can build Python applications in the database. This eliminates barriers of security, compliance, governance, integrity, and a host of similar issues related to moving vast amounts of data around. This new capability brings Python to the data and runs code inside secure SQL Server using the proven extensibility mechanism built in SQL Server 2016.

  • Easy deployment: Once you have the Python model ready, deploying it in production is now as easy as embedding it in a T-SQL script, and then any SQL client application can take advantage of Python-based models and intelligence by a simple stored procedure call.

  • Enterprise-grade performance and scale: You can use SQL Server’s advanced capabilities like in-memory table and column store indexes with the high-performance scalable APIs in RevoScalePy package. RevoScalePy is modeled after RevoScaleR package in SQL Server R Services. Using these with the latest innovations in the open source Python world allows you to bring unparalleled selection, performance, and scale to your SQL Python applications.

  • Rich extensibility: You can install and run any of the latest open source Python packages in SQL Server to build deep learning and AI applications on huge amounts of data in SQL Server. Installing a Python package in SQL Server is as simple as installing a Python package on your local machine.

  • Wide availability at no additional costs: Python integration is available in all editions of SQL Server 2017, including the Express edition.

Nagesh Pabbisetty also announces Microsoft R Server 9.1:

We took the first step with Microsoft R Server 9.0, and this follow on release includes significant innovations such as:

  • New machine learning enhancements and inclusion of pre-trained cognitive models such as sentiment analysis & image featurizers

  • SQL Server Machine Learning Services with integrated Python in Preview

  • Enterprise grade operationalization with real-time scoring and dynamic scaling of VMs

  • Deep customer & ISV partnerships to deliver the right solutions to customers

  • A panoply of sources to help you get started with ease

And Joseph Sirosh indicates that AI is where the money is:

So today it’s my pleasure to announce the first RDBMS with built-in AIa production-quality Community Technology Preview (CTP 2.0) of SQL Server 2017. In this preview release, we are introducing in-database support for a rich library of machine learning functions, and now for the first time Python support (in addition to R). SQL Server can also leverage NVIDIA GPU-accelerated computing through the Python/R interface to power even the most intensive deep-learning jobs on images, text, and other unstructured data. Developers can implement NVIDIA GPU-accelerated analytics and very sophisticated AI directly in the database server as stored procedures and gain orders of magnitude higher throughput. In addition, developers can use all the rich features of the database management system for concurrency, high-availability, encryption, security, and compliance to build and deploy robust enterprise-grade AI applications.

There’s a lot to digest here.

Temporal Tables For R Source Control

Tomaz Kastrun shares an unorthodox way of collecting historical R code changes:

I will not comment on the solution Bob provided, since I don’t know how their infrastructure, roles, security is set up. At this point, I am grateful for his comment. But what I will comment, is that there is no straightforward way or any out-of-the-box solution. Furthermore, if your R code requires any additional packages, storing the packages with your R code is not that bad idea, regardless of traffic or disk overhead. And versioning the R code is something that is for sure needed.

To continue from previous post, getting or capturing R code, once it gets to Launchpad, is tricky. So storing R code it in a database table or on file system seems a better idea.

It’s an interesting concept.  My preference is to use R Tools for Visual Studio and a more traditional source control mechanism.  It involves keeping source control up to date, but that’s a good practice to follow in any case.

Parameters In rmarkdown Reports

Steph Locke shows how to use table parameters in rmarkdown reports:

The recent(ish) advent of parameters in rmarkdown reports is pretty nifty but there’s a little bit of behaviour that can come in handy but doesn’t come across in the documentation. You can use table parameters for rmarkdown reports.

Previously, if you wanted to produce multiple reports based off a dataset, you would make the dataset available and then perform filtering in the report. Now we can pass the filtered data directly to the report, which keeps all the filtering logic in one place.

It’s actually super simple to add table parameters for rmarkdown reports.

Click through to see the script.  As promised, it is in fact easy to do.

Categories

April 2017
MTWTFSS
« Mar  
 12
3456789
10111213141516
17181920212223
24252627282930