Category: R

I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organizations, and it is important for both languages to coexist.

Many times, this means that R coders will develop a workflow in R but then must redesign and recode it in Python for their production systems. If the coder is lucky, this is easy, and the R model can be exported as a serialized object and read into Python. There are packages that do this, such as pmml. Unfortunately, many times, this is more challenging because the production system might demand that the entire end to end workflow is built exclusively in Python. That’s sometimes tough because there are aspects of statistical model building in R which are more intuitive than Python.

Python has many strengths, such as its robust data structures such as Dictionaries, compatibility with Deep Learning and Spark, and its ability to be a multipurpose language. However, many scenarios in enterprise analytics require people to go back to basic statistics and Machine Learning, which the classic Data Science packages in Python are not as intuitive as R for. The key difference is that many statistical methods are built into R natively. As a result, there is a gap for when R users must build workflows in Python. To try to bridge this gap, this post will discuss a relatively new package developed by Microsoft, revoscalepy.

Having worked with both, my loyalties tend to lie with R for a couple of reasons. But this might help some people bridge the gap.

Comments closed

Using Keras To Predict Customer Churn

Published 2017-11-30 by Kevin Feasel

Matt Dancho has an example of building a neural net using Keras to predict customer churn:

Pro Tip: A quick test is to see if the log transformation increases the magnitude of the correlation between “TotalCharges” and “Churn”. We’ll use a few dplyr operations along with the corrr package to perform a quick correlation.

correlate(): Performs tidy correlations on numeric data
focus(): Similar to select(). Takes columns and focuses on only the rows/columns of importance.
fashion(): Makes the formatting aesthetically easier to read.

This is a very useful tutorial.

Comments closed

Installing SQL Server 2017 Machine Learning Services

Published 2017-11-29 by Kevin Feasel

Ginger Grant shows how to install SQL Server 2017 Machine Learning Services:

There are two installation options: In-Database or Standalone. If you are evaluating Machine Learning Services and you have no knowledge of what the load may be, start by selecting the Machine Learning Service In-Database. There are several reasons why by default you want to select the In-Database option. One of the problems that Microsoft was looking to solve by incorporating advanced data analytics was to improve performance of the native code by greatly reducing data latency. If you are analyzing a lot of data which is stored within SQL Server, the performance will be improved if the data does not need to be moved around on a network. Also, the licensing costs of installing R Server standalone also need to be evaluated with a Microsoft representative as well. An evaluation of the resource load on the network, as well as analysis of the code running on SQL Server should be performed prior to the decision to install the Machine Learning Server Standalone.

Read the whole thing.

Comments closed

A/B Testing With R

Published 2017-11-27 by Kevin Feasel

Mira Celine Klein shows how to compare two versions of a feature (or advertising campaign or whatever) to determine if one is better than the other:

In comparison to other methods, conducting an A/B test does not require extensive statistical knowledge. Nevertheless, some caveats have to be taken into account.

When making a statistical decision, there are two possible errors (see also table 1): A Type I error means that we observe a significant result although there is no real difference between our groups. A Type II error means that we do not observe a significant result although there is in fact a difference. The Type I error can be controlled and set to a fixed number in advance, e.g., at 5%, often denoted as α or the significance level. The Type II error in contrast cannot be controlled directly. It decreases with the sample size and the magnitude of the actual effect. When, for example, one of the designs performs way better than the other one, it’s more likely that the difference is actually detected by the test in comparison to a situation where there is only a small difference with respect to the target metric. Therefore, the required sample size can be computed in advance, given α and the minimum effect size you want to be able to detect (statistical power analysis). Knowing the average traffic on the website you can get a rough idea of the time you have to wait for the test to complete. Setting the rule for the end of the test in advance is often called “fixed-horizon testing”.

Click through for more, including a sample with code. H/T R-Bloggers

Comments closed

R Internals: Data Sizes With Nullable Columns

Published 2017-11-27 by Kevin Feasel

Niels Berglund digs into the Binary Exchange Langage (BXL) and notices something weird about data sizes:

When looking at the data sent, the size of the packages and “drilling” into the TCP packets we could deduct that: :

Each column has an over-head of 32 bytes (at least for non nullable data)
The size of the column in one row is the size of the data type for numeric types.
For decimal and numeric an extra byte is added to each column, where this byte indicates the precision.
Columns of alpha numeric type all had 2 bytes pre-pended to the bytes, except max types.
For char and nchar the storage size was 2 bytes plus the size the column was defined as.
For varchar and nvarchar the storage size was 2 bytes plus the size of the data stored.
For the varmax data types the number of bytes that were pre-pended varied dependent on the data size.

Read the whole thing.

Comments closed

Housing Prices In Ames, Iowa: A Kaggle Competition

Published 2017-11-21 by Kevin Feasel

Kathryn Bryant and M. Aaron Owen share their Kaggle experiences. First, Kathryn, et al:

The lifecycle of our project was a typical one. We started with data cleaning and basic exploratory data analysis, then proceeded to feature engineering, individual model training, and ensembling/stacking. Of course, the process in practice was not quite so linear and the results of our individual models alerted us to areas in data cleaning and feature engineering that needed improvement. We used root mean squared error (RMSE) of log Sale Price to evaluate model fit as this was the metric used by Kaggle to evaluate submitted models.

Data cleaning, EDA, feature engineering, and private train/test splitting (and one spline model!) were all done in R but we used Python for individual model training and ensembling/stacking. Using R and Python in these ways worked well, but the decision to split work in this manner was driven more by timing than anything else.

Then, Aaron, et al, share their process and findings:

Some variables had a moderate amount of missingness. For example, about 17% of the houses were missing the continuous variable, Lot Frontage, the linear feet of street connected to the property. Intuitively, attributes related to the size of a house are likely important factors regarding the price of the house. Therefore, dropping these variables seems ill-advised.

Our solution was based on the assumption that houses in the same neighborhood likely have similar features. Thus, we imputed the missing Lot Frontage values based on the median Lot Frontage for the neighborhood in which the house with missing value was located.

This is the major upside to Kaggle: it gives you the ability to work in a controlled environment with real data sets, which include real data problems. Yeah, the data’s much cleaner than you’d experience in production pretty much anywhere, but that lets you practice technique with a relatively low barrier to entry. H/T R-Bloggers (Kathryn | Aaron)

Comments closed

Data Wrangling At Scale

Published 2017-11-21 by Kevin Feasel

John Mount has a short article showing off the cdata package:

Suppose we needed to un-pivot this data into a row oriented representation. Often big data transform steps can achieve a much higher degree of parallelization with “tall data”. With the cdata package this transform is easy and performant, as we show below.

Read the whole thing.

Comments closed

Docker And R

Published 2017-11-21 by Kevin Feasel

Mara Averick has some resources to help you get started with running R in a Docker container:

liftr 📦 by Nan Xiao

🐳📜 Docker your docs: “liftr: Containerize R Markdown Documents” by @road2stat https://t.co/nSN1ylZXsy #rstats #docker #rmarkdown pic.twitter.com/9xJQ4rANjy

— Mara Averick (@dataandme) October 15, 2017

liftr aims to solve the problem of persistent reproducible reporting. To achieve this goal, it extends the R Markdown metadata format, and uses Docker to containerize and render R Markdown documents.

Click through for those resources as well as an addictive 8-bit animated GIF.

Comments closed

Building Dynamic Row Headers With ML Services

Published 2017-11-17 by Kevin Feasel

Dave Mason tries to get around his RESULT SETS limitation when using SQL Server Machine Learning Services:

The columns in the data frame clearly have names, but SQL Server isn’t using them. The data frame columns have types in R too (more on this in a moment). Now that makes me wonder about the data types for the data returned by SQL. How is that determined? If SQL isn’t using the column names, can I assume it isn’t making use of the R column types either?

For a point of reference, let’s run some more R code to show the column names and types. As before, the rvest package is used to scrape a web page, with each HTML <table> found becoming a data frame in the “tables” list (line 3). A data frame of table metadata is created by calling data.frame(). The first parameter is a vector of column names (line 4), the second parameter is a vector of column classes (line 5), and the third parameter causes the row “names” to be incrementing digits (line 6).

This is a work in progress as Dave continues his series.

Comments closed

Defining Result Sets With ML Services

Published 2017-11-16 by Kevin Feasel

Dave Mason covers a pain point in SQL Server Machine Learning Services:

The example above is so simple, defining the RESULT SETS poses no problems. But what if the format of the output isn’t known at design time? R (or Python) might take the input data set and add, remove, or change columns conditionally. Further, the input data set might not even be known at design time. How would you define the RESULT SETS at run time?

WITH RESULT SETS needs a MAKE_A_GUESS or FIGURE_IT_OUT option. If there’s some other type of “easy button” for this, I haven’t found it.

It would be nice if the service could the ability to read the data frame columns and use those by default.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31