Press "Enter" to skip to content

Category: R

Using Schemas with DBI and SQL Server

Thomas Roh takes us through an oddity in R’s DBI library:

I ran into an issue the other day where I was tring to write a new table to a SQL Server Database with a non-default schema. I did end up spending a bit of time debugging and researching so I wanted to share for anyone else that runs into the issue. Using the DBI::Id function, allows you to specify the schema when you are trying to write a table to a SQL Server database.

Click through for the end result. I will say that the more I work with DBI, the more I’m tempted to keep using rodbc, at least when working with SQL Server. H/T R-Bloggers.

Comments closed

Building Custom R Packages

Brad Lindblad takes us through building a custom package in R:

Don’t repeat yourself (DRY) is a well-known maxim in software development, and most R programmers follow this rule and build functions to avoid duplicating code. But how often do you:
– Reference the same dataset in different analyses
– Create the same ODBC connection to a database
– Tinker with the same colors and themes in ggplot
– Produce markdown docs from the same template

and so on? Notice a pattern? The word “same” is sprinkled in each bullet point. I smell an opportunity to apply DRY!

This is a good point: packages don’t have to go out to the broader world. They’re useful even if they just help you (or your team) out. H/T R-bloggers

Comments closed

Evaluating a Classification Model with a Spam Filter

John Mount shares an extract from Mount and Nina Zumel’s Practical Data Science with R, 2nd Edition:

This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.

It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a side-effect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.

Click through for that extract. I liked the first edition of the book, so I’m looking forward to the 2nd.

Comments closed

Fun with [ as a Function in R

John Mount definitely quotes Dr. Ian Malcolm correctly in this one:

How about defining a new [-based function call notation? The ideas is: we could write sin[5] in place of sin(5), thus unifying the notations for function call and array access. Some languages do in fact have unified function call and array access (though often using “(” for both). Examples languages include Fortran and Matlab.

Let’s add R to the list of such languages.

I love the flexibility in the language, almost as much as I would enjoy taking away production rights from the person who ships this in my code base…

Comments closed

Multiple Hypothesis Testing with R

Roland Stevenson shows how we can perform multiple hypothesis tests on data, as well as potential issues:

Both results show that evaluating two tests on the same family of data will lead to a ~10% chance that a researcher will claim a “significant” result if they look for either test to reject the null. Any claim there is a maximum 5% false positive rate would be mistaken. As an exercise, verify that doing the same on \(m=4\) tests will lead to an ~18% chance!

A bad testing platform would be one that claims a maximum 5% false positive rate when any one of multiple tests on the same family of data show significance at the 5% level. Clearly, if a researcher is going to claim that the FWER is no more than \(\alpha\), then they must control for the FWER and carefully consider how individual tests reject the null.

This is worth taking some time to read carefully. H/T R-Bloggers

Comments closed

Principal Component Analysis in Python

Abhinav Choudhary shows us how to implement Principal Component Analysis in Python:

Principal Component Analysis (PCA) is an unsupervised statistical technique used to examine the interrelation among a set of variables in order to identify the underlying structure of those variables. In simple words, suppose you have 30 features column in a data frame so it will help to reduce the number of features making a new feature which is the combined effect of all the feature of the data frame. It is also known as factor analysis.

PCA is quite useful in practice, though it has the unfortunate side effect of making it harder to interpret which factors are driving your solution.

Comments closed

Using purrr to Eliminate Looped Function Calls

Sebastian Sauer demonstrates using the pmap() function in purrr to call a function multiple times with different parameters:

Assume you have to call a function multiple times, but each with (possibly) different argument. Given enough repitioons, you will not want to repeat yourself.

In other words, we would like to loop over function arguments, each round in the loop giving the respective argument’value(s) to the function.

This is one of the benefits of functional-style programming: loops become higher-order functions, which take less time to write and keeps your code from looking like a pyramid of doom.

Comments closed

Fun with Residual Plots

Nina Zumel explains why, when plotting residuals, you always put predictions on the X axis and residuals on the Y axis:

One reason that the proper residual graph (for a well fit model) should smooth out to the line y=0 is known as reversion to mediocrity, or regression to the mean.

Imagine that you have an ideal process that always produces a single value y. You don’t actually observe this “true value”; instead, what you observe is y plus (IID, zero mean) noise. You can build a “model” for this process that predicts the mean of the observations, in this case the value 0.1033149. Then you can calculate the residuals of your “model” in the usual way.

This post went in a direction I wasn’t expecting, and it was all the better for it.

Comments closed

Topic Modeling

Federico Pascual has an article on topic modeling and topic classification:

Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents. It’s known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.

Since topic modeling doesn’t require training, it’s a quick and easy way to start analyzing your data. However, you can’t guarantee you’ll receive accurate results, which is why many businesses opt to invest time training a topic classification model.

The article is long but worth the read, with examples in Python and additional notes for R.

Comments closed