R – Page 114 – Curated SQL

Housing Prices In Ames, Iowa: A Kaggle Competition

Published 2017-11-21 by Kevin Feasel

Kathryn Bryant and M. Aaron Owen share their Kaggle experiences. First, Kathryn, et al:

The lifecycle of our project was a typical one. We started with data cleaning and basic exploratory data analysis, then proceeded to feature engineering, individual model training, and ensembling/stacking. Of course, the process in practice was not quite so linear and the results of our individual models alerted us to areas in data cleaning and feature engineering that needed improvement. We used root mean squared error (RMSE) of log Sale Price to evaluate model fit as this was the metric used by Kaggle to evaluate submitted models.

Data cleaning, EDA, feature engineering, and private train/test splitting (and one spline model!) were all done in R but we used Python for individual model training and ensembling/stacking. Using R and Python in these ways worked well, but the decision to split work in this manner was driven more by timing than anything else.

Then, Aaron, et al, share their process and findings:

Some variables had a moderate amount of missingness. For example, about 17% of the houses were missing the continuous variable, Lot Frontage, the linear feet of street connected to the property. Intuitively, attributes related to the size of a house are likely important factors regarding the price of the house. Therefore, dropping these variables seems ill-advised.

Our solution was based on the assumption that houses in the same neighborhood likely have similar features. Thus, we imputed the missing Lot Frontage values based on the median Lot Frontage for the neighborhood in which the house with missing value was located.

This is the major upside to Kaggle: it gives you the ability to work in a controlled environment with real data sets, which include real data problems. Yeah, the data’s much cleaner than you’d experience in production pretty much anywhere, but that lets you practice technique with a relatively low barrier to entry. H/T R-Bloggers (Kathryn | Aaron)

Comments closed

Data Wrangling At Scale

Published 2017-11-21 by Kevin Feasel

John Mount has a short article showing off the cdata package:

Suppose we needed to un-pivot this data into a row oriented representation. Often big data transform steps can achieve a much higher degree of parallelization with “tall data”. With the cdata package this transform is easy and performant, as we show below.

Read the whole thing.

Comments closed

Docker And R

Published 2017-11-21 by Kevin Feasel

Mara Averick has some resources to help you get started with running R in a Docker container:

liftr 📦 by Nan Xiao

🐳📜 Docker your docs: “liftr: Containerize R Markdown Documents” by @road2stat https://t.co/nSN1ylZXsy #rstats #docker #rmarkdown pic.twitter.com/9xJQ4rANjy

— Mara Averick (@dataandme) October 15, 2017

liftr aims to solve the problem of persistent reproducible reporting. To achieve this goal, it extends the R Markdown metadata format, and uses Docker to containerize and render R Markdown documents.

Click through for those resources as well as an addictive 8-bit animated GIF.

Comments closed

Building Dynamic Row Headers With ML Services

Published 2017-11-17 by Kevin Feasel

Dave Mason tries to get around his RESULT SETS limitation when using SQL Server Machine Learning Services:

The columns in the data frame clearly have names, but SQL Server isn’t using them. The data frame columns have types in R too (more on this in a moment). Now that makes me wonder about the data types for the data returned by SQL. How is that determined? If SQL isn’t using the column names, can I assume it isn’t making use of the R column types either?

For a point of reference, let’s run some more R code to show the column names and types. As before, the rvest package is used to scrape a web page, with each HTML <table> found becoming a data frame in the “tables” list (line 3). A data frame of table metadata is created by calling data.frame(). The first parameter is a vector of column names (line 4), the second parameter is a vector of column classes (line 5), and the third parameter causes the row “names” to be incrementing digits (line 6).

This is a work in progress as Dave continues his series.

Comments closed

Defining Result Sets With ML Services

Published 2017-11-16 by Kevin Feasel

Dave Mason covers a pain point in SQL Server Machine Learning Services:

The example above is so simple, defining the RESULT SETS poses no problems. But what if the format of the output isn’t known at design time? R (or Python) might take the input data set and add, remove, or change columns conditionally. Further, the input data set might not even be known at design time. How would you define the RESULT SETS at run time?

WITH RESULT SETS needs a MAKE_A_GUESS or FIGURE_IT_OUT option. If there’s some other type of “easy button” for this, I haven’t found it.

It would be nice if the service could the ability to read the data frame columns and use those by default.

Comments closed

Interpreting P-Value Histograms

Published 2017-11-15 by Kevin Feasel

David Robinson visualizes and interprets different p-value histograms:

So you’re a scientist or data analyst, and you have a little experience interpreting p-values from statistical tests. But then you come across a case where you have hundreds, thousands, or even millions of p-values. Perhaps you ran a statistical test on each gene in an organism, or on demographics within each of hundreds of counties. You might have heard about the dangers of multiple hypothesis testing before. What’s the first thing you do?

Make a histogram of your p-values. Do this before you perform multiple hypothesis test correction, false discovery rate control, or any other means of interpreting your many p-values. Unfortunately, for some reason, this basic and simple task rarely gets recommended (for instance, the Wikipedia page on the multiple comparisons problem never once mentions this approach). This graph lets you get an immediate sense of how your test behaved across all your hypotheses, and immediately diagnose some potential problems. Here, I’ll walk you through a basic example of interpreting a p-value histogram.

It’s a fun read and informative as well.

Comments closed

A Compendium Of R Errors

Published 2017-11-15 by Kevin Feasel

Sumendar Karupakala has a bunch of errors you might find in R, as well as their explanations and fixes:

#pull out the animals which are dogs

animaldata[animaldata$Animal.Type == “Dog” ] # throuws an error

Error in `[.data.frame`(animaldata, animaldata$Animal.Type == “Dog”): undefined columns selected

Traceback:

1. animaldata[animaldata$Animal.Type == “Dog”]

2. `[.data.frame`(animaldata, animaldata$Animal.Type == “Dog”)

3. stop(“undefined columns selected”)

In [8]:

#fixed error

animaldata[animaldata$Animal.Type == “Dog”, ] # missedout comma with in the bracket

Some of it is basic syntax; others are a bit nastier.

Comments closed

Curl For Windows Now Uses WinSSL

Published 2017-11-14 by Kevin Feasel

David Smith notes that curl version 3.0 supports the winSSL library rather than installing OpenSSL:

To implement secure communications, the curl package needs to connect with a library that handles the SSL (secure socket layer) encryption. On Linux and Macs, curl has always used the OpenSSL library, which is included on those systems. Windows doesn’t have this library (at least, outside of the Subsystem for Linux), so on Windows the curl package included the OpenSSL library and associated certificate. This raises its own set of issues (see the post linked below for details), so version 3.0 of the package instead uses the built-in winSSL library. This means curl uses the same security architecture as other connected applications on Windows.

This shouldn’t have any impact on your web-connectivity from R now or in the future, except the knowledge that the underlying architecture is more secure. Nonetheless, it’s possible to switch back to OpenSSL-based encryption (and this remains the default on Windows 7, which does not include the winSSL).

Click through for more information.

Comments closed

Scraping SQL Saturday Statistics

Published 2017-11-14 by Kevin Feasel

Tomaz Kastrun shows how to use rvest to read the SQL Saturday website and parse schedule details:

I wanted to check a simple query: How many times has a particular topic been presented and from how many different presenters.

Sounds interesting, tackling the problem should not be a problem, just that the end numbers may vary, since there will be some text analysis included.

Read on for the code and some analysis.

Comments closed

Offline Installation Of SQL Server 2017 ML Services

Published 2017-11-13 by Kevin Feasel

Jan Mulkens shows how to install SQL Server 2017 Machine Learning Services when your the server hosting SQL Server doesn’t have outbound internet access:

That’s when you remember it… Your server isn’t connected to the internet!
Pretty normal, but in your enthusiasm you completely forgot that SQL Server needs to download some binaries for the R and Python components you so desperately want on your precious machine!

Luckily, the installer comes to your rescue and shows you where to download those binaries it needs.
Turns out however… This link only is for one R component and the installer won’t let you pass to the next screen!

Read on for the answer.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: R

liftr 📦 by Nan Xiao