R – Page 78 – Curated SQL

Additivity and linearity as the second most important assumptions in linear models
We assume that \(y\) is a linear function of the predictors. If y is not a linear function of the predictors, we cannot expect the model to deliver correct insights (predictions, causal coefficients). Let’s check an example.

Read on to understand what this means, as well as the most important assumption.

Comments closed

Updates to AzureR Packages

Published 2019-11-13 by Kevin Feasel

Hong Ooi announces changes to several AzureR packages:

AzureVM 2.1.0
You can now create VM scalesets with attached data disks. In addition, you can specify the disk type (Standard_LRS, StandardSSD_LRS, or Premium_LRS) for the OS disk and, for a Linux Data Science Virtual Machine, the supplied data disk. This enables using VM sizes that don’t support Premium storage.

Click through for the full set of updates.

Comments closed

Merging Datasets in R with the Tidyverse

Published 2019-11-04 by Kevin Feasel

Anisa Dhana shows off several tidyverse methods for combining data sets together:

semi_join
The semi_join function is different than the previous examples of joins. A semi join creates a new dataset in which there are all rows from the data1 where there is a corresponding matching value in data2. Still, instead of the final dataset merging both the first (data1) and second (data2) datasets, it only contains the variables from the first one (data1).

Most of this looks like standard SQL joins, but read through to the end for a bonus which doesn’t typically appear in relational database products.

Comments closed

Mocking Objects with R

Published 2019-10-31 by Kevin Feasel

The R-hub blog has an interesting post on creating mocks in R for unit testing:

In some of these cases, the programming concept you’re after is mocking, i.e. making a function act as if something were a certain way! In this blog post we shall offer a round-up of resources around mocking, or not mocking, when unit testing an R package.

It’s interesting watching data scientists work through the same sorts of problems which traditional developers have hit, whether that be testing, deployment, or source control management. H/T R-bloggers

Comments closed

Plotting Three-Dimensional Linear Models

Published 2019-10-29 by Kevin Feasel

Sebastian Sauer shows a few techniques for visualizing linear models with two predictors:

Linear models are a standard way of predicting or explaining some data. Visualizing data is not only of didactical value but provides heuristical value too, as demonstrated by Anscombe’s Quartet.
Visualizing linear models in 2D is straightforward, but visualizing linear models with more than one predictor is much less so. The aim of this post is to demonstrate some ways do visualize linear models with more than one predictor, using popular R packages. We will focus on 3D examples, that is, two predictors.

I have a strong bias against 3D visuals because they tend to be so difficult to see clearly. There are times when they’re necessary, though.

Comments closed

Re-Introducing rquery

Published 2019-10-28 by Kevin Feasel

John Mount has a new introduction to rquery:

rquery is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of R’s base::transform(), or dplyr’s dplyr::mutate() and uses a pipe in the style popularized in R with magrittr. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional SQL “window functions.” More on the background and context of rquery can be found here.
The R/rquery version of this introduction is here, and the Python/data_algebra version of this introduction is here.

Check it out.

Comments closed

rBokeh Tips for Missing Arguments

Published 2019-10-23 by Kevin Feasel

Matthias Nistler walks through troubleshooting rBokeh missing argument errors:

This approach is my go-to solution to change a rBokeh plot for which there is an argument missing in rBokeh that is available in python.
– Create the plot.
– Inspect the structure (str(plot)) of the rBokeh object.
– Search for the python’s argument name.
– Overwrite the value with the desired option as derived from python’s bokeh.

Given how nice the bokeh package looks, I really want rBokeh to work well. Hopefully this experience improves over time.

Comments closed

Using Schemas with DBI and SQL Server

Published 2019-10-22 by Kevin Feasel

Thomas Roh takes us through an oddity in R’s DBI library:

I ran into an issue the other day where I was tring to write a new table to a SQL Server Database with a non-default schema. I did end up spending a bit of time debugging and researching so I wanted to share for anyone else that runs into the issue. Using the DBI::Id function, allows you to specify the schema when you are trying to write a table to a SQL Server database.

Click through for the end result. I will say that the more I work with DBI, the more I’m tempted to keep using rodbc, at least when working with SQL Server. H/T R-Bloggers.

Comments closed

Building Custom R Packages

Published 2019-10-21 by Kevin Feasel

Brad Lindblad takes us through building a custom package in R:

Don’t repeat yourself (DRY) is a well-known maxim in software development, and most R programmers follow this rule and build functions to avoid duplicating code. But how often do you:
– Reference the same dataset in different analyses
– Create the same ODBC connection to a database
– Tinker with the same colors and themes in ggplot
– Produce markdown docs from the same template
and so on? Notice a pattern? The word “same” is sprinkled in each bullet point. I smell an opportunity to apply DRY!

This is a good point: packages don’t have to go out to the broader world. They’re useful even if they just help you (or your team) out. H/T R-bloggers

Comments closed

Evaluating a Classification Model with a Spam Filter

Published 2019-10-21 by Kevin Feasel

John Mount shares an extract from Mount and Nina Zumel’s Practical Data Science with R, 2nd Edition:

This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.
It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a side-effect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.

Click through for that extract. I liked the first edition of the book, so I’m looking forward to the 2nd.

Comments closed

Category: R

Important Assumptions with Linear Models

Updates to AzureR Packages

Merging Datasets in R with the Tidyverse

Mocking Objects with R

Plotting Three-Dimensional Linear Models

Re-Introducing rquery

rBokeh Tips for Missing Arguments

Using Schemas with DBI and SQL Server

Building Custom R Packages

Evaluating a Classification Model with a Spam Filter