R – Page 69 – Curated SQL

A client recently came to us with a question: what’s a good way to monitor data or model output for changes? That is, how can you tell if new data is distributed differently from previous data, or if the distribution of scores returned by a model have changed? This client, like many others who have faced the same problem, simply checked whether the mean and standard deviation of the data had changed more than some amount, where the threshold value they checked against was selected in a more or less ad-hoc manner. But they were curious whether there was some other, perhaps more principled way, to check for a change in distribution.

The answer is, of course, that there is. Click through to see a few of the techniques.

Comments closed

Changes in the R foreach Package

Published 2020-02-12 by Kevin Feasel

Hong Ooi announces some changes to the foreach package in R:

This post is to announce some new and upcoming changes in the foreach package.
First, foreach can now be found on GitHub! The repository is at https://github.com/RevolutionAnalytics/foreach, replacing its old home on R-Forge. Right now the repo hosts both the foreach and iterators packages, but that may change later.

There are also some changes to the package itself, so read on for those.

Comments closed

Pulling R Packages from Fedora

Published 2020-02-11 by Kevin Feasel

Inaki Ucar has an interesting project:

Bringing R packages to Fedora (in fact, to any distro) is an Herculean task, especially considering the rate at which CRAN grows nowadays. So I am happy to announce the cran2copr project, which is an attempt to maintain binary RPM repos for most of CRAN (~15k packages as of Feb. 2020) in an automated way using Fedora Copr.

Click through for installation instructions if you’re using an RPM-based Linux distribution like Fedora or CentOS. H/T R-Bloggers.

Comments closed

Calculating Distances in R

Published 2020-02-10 by Kevin Feasel

Chris Brown gives us three ways to calculate distance in R:

Calculating a distance on a map sounds straightforward, but it can be confusing how many different ways there are to do this in R.
This complexity arises because there are different ways of defining ‘distance’ on the Earth’s surface.
The Earth is spherical. So do you want to calculate distances around the sphere (‘great circle distances’) or distances on a map (‘Euclidean distances’).
Then there are barriers. For example, for distances in the ocean, we often want to know the nearest distance around islands.
Then there is the added complexity of the different spatial data types. Here we will just look at points, but these same concepts apply to other data types, like shapes.

Read on to learn these three separate techniques. H/T R-Bloggers.

Comments closed

Check Those R Repos

Published 2020-02-06 by Kevin Feasel

John Mount has a public service announcement:

In a lot of our R writing we casually say “install from CRAN using install.packages('PKGNAME')” or “update your packages by using update.packages(ask = FALSE, checkBuilt = TRUE) (and answering ‘no’ to all questions about compiling).”
We recently became aware that for some users this isn’t complete advice.
The above depends on your R install pointing to a repository that is in fact up to date. To check what repositories you are using please use the command options('repos').

The specific example here is around the Microsoft R Archive Network (MRAN), which stays at fixed dates. This is for a good reason: because it helps companies standardize on a known set of versions of R packages by default. That way you don’t have version 1.8 of a package in dev and then get 1.9 in production and find out that something broke between the two versions.

Comments closed

Audio Analysis in R

Published 2020-02-04 by Kevin Feasel

Jeroen Ooms walks us through some audio analysis with R and the av package:

The latest version of the rOpenSci av package includes some useful new tools for working with audio data. We have added functions for reading, cutting, converting, transforming, and plotting audio data in any popular audio / video format (mp3, mkv, aac, etc).
The functionality can either be used by itself, or to prepare audio data for further analysis in R using other packages. We hope this clears an important hurdle to use R for research on speech, music, and whale mating calls.

One of the most interesting things I saw Edward Tufte demonstrate was visualizing music using the Music Animation Machine. There’s a lot of space here to experiment. H/T R-Bloggers.

Comments closed

Generating Fake Data with R

Published 2020-02-03 by Kevin Feasel

Dave Mason takes a look at generating fake PII in R:

I’ve been thinking about R and how it can be used by developers, DBAs, and other SQL Server professionals that aren’t data scientists per se. A recent article about generating a data set of fake transactional data got me thinking about this again and I wondered, can R be used to obfuscate PII data?
In a word, yes. Well, mostly. (More on this in a bit.) As with anything R-related, there are probably multiple packages that are useful for any given task. For this one, I’ll focus on the “generator” package.

Click through to see what it does and Dave’s thoughts on the topic. It would also be possible to generate fake data in R by hitting a web API like Daniel Hutmacher’s service.

Comments closed

Fun with Palindromic Dates

Published 2020-02-03 by Kevin Feasel

Tomaz Kastrun has a bit of fun with the date February 2, 2020:

As of writing this blog-post, today is February 2nd, 2020. Or as I would say it, 2nd of February, 2020. There is nothing magical about it, it is just a sequence of numbers. On a boring Sunday evening, what could be more thrilling to look into this little bit further 🙂
Let’s kick R Studio and start writing a lot of useless stuff.

Tomaz also compares US versus EU palindromic dates and visualizes the different distributions.

Comments closed

Cloudera and R

Published 2020-01-30 by Kevin Feasel

Ian Cook shows us how Cloudera works for R users:

Our customers love it when they can use familiar syntax to work with data regardless of its size or its source. The popularity of sparklyr is a case in point: it enables R users to use either SQL or dplyr—both familiar to most R users—to work with large-scale data using Apache Spark. Two R packages developed at Cloudera—implyr and tidyquery—aim to provide this same choice of either SQL or dplyr when querying tables with Apache Impala and when manipulating R data frames.

The implyr package is new to me, but looks interesting.

Comments closed

R for Systems Administration

Published 2020-01-29 by Kevin Feasel

Ian Flores Siaca says hey, why not use R for systems administration:

jrDroplet is a package designed specifically to manage Virtual Machines in Digital Ocean for our training courses. The idea is that with a single line we are able to create a Digital Ocean droplet with the packages installed for our courses, hiding all of the background complexities related to infrastructure.

It’s an interesting use for this DSL.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: R

Monitoring for Distribution Changes

Changes in the R foreach Package

Pulling R Packages from Fedora

Calculating Distances in R

Check Those R Repos

Audio Analysis in R

Generating Fake Data with R

Fun with Palindromic Dates

Cloudera and R

R for Systems Administration