R – Page 73 – Curated SQL

Calculating Distances in R

Published 2020-02-10 by Kevin Feasel

Chris Brown gives us three ways to calculate distance in R:

Calculating a distance on a map sounds straightforward, but it can be confusing how many different ways there are to do this in R.
This complexity arises because there are different ways of defining ‘distance’ on the Earth’s surface.
The Earth is spherical. So do you want to calculate distances around the sphere (‘great circle distances’) or distances on a map (‘Euclidean distances’).
Then there are barriers. For example, for distances in the ocean, we often want to know the nearest distance around islands.
Then there is the added complexity of the different spatial data types. Here we will just look at points, but these same concepts apply to other data types, like shapes.

Read on to learn these three separate techniques. H/T R-Bloggers.

Comments closed

Check Those R Repos

Published 2020-02-06 by Kevin Feasel

John Mount has a public service announcement:

In a lot of our R writing we casually say “install from CRAN using install.packages('PKGNAME')” or “update your packages by using update.packages(ask = FALSE, checkBuilt = TRUE) (and answering ‘no’ to all questions about compiling).”
We recently became aware that for some users this isn’t complete advice.
The above depends on your R install pointing to a repository that is in fact up to date. To check what repositories you are using please use the command options('repos').

The specific example here is around the Microsoft R Archive Network (MRAN), which stays at fixed dates. This is for a good reason: because it helps companies standardize on a known set of versions of R packages by default. That way you don’t have version 1.8 of a package in dev and then get 1.9 in production and find out that something broke between the two versions.

Comments closed

Audio Analysis in R

Published 2020-02-04 by Kevin Feasel

Jeroen Ooms walks us through some audio analysis with R and the av package:

The latest version of the rOpenSci av package includes some useful new tools for working with audio data. We have added functions for reading, cutting, converting, transforming, and plotting audio data in any popular audio / video format (mp3, mkv, aac, etc).
The functionality can either be used by itself, or to prepare audio data for further analysis in R using other packages. We hope this clears an important hurdle to use R for research on speech, music, and whale mating calls.

One of the most interesting things I saw Edward Tufte demonstrate was visualizing music using the Music Animation Machine. There’s a lot of space here to experiment. H/T R-Bloggers.

Comments closed

Generating Fake Data with R

Published 2020-02-03 by Kevin Feasel

Dave Mason takes a look at generating fake PII in R:

I’ve been thinking about R and how it can be used by developers, DBAs, and other SQL Server professionals that aren’t data scientists per se. A recent article about generating a data set of fake transactional data got me thinking about this again and I wondered, can R be used to obfuscate PII data?
In a word, yes. Well, mostly. (More on this in a bit.) As with anything R-related, there are probably multiple packages that are useful for any given task. For this one, I’ll focus on the “generator” package.

Click through to see what it does and Dave’s thoughts on the topic. It would also be possible to generate fake data in R by hitting a web API like Daniel Hutmacher’s service.

Comments closed

Fun with Palindromic Dates

Published 2020-02-03 by Kevin Feasel

Tomaz Kastrun has a bit of fun with the date February 2, 2020:

As of writing this blog-post, today is February 2nd, 2020. Or as I would say it, 2nd of February, 2020. There is nothing magical about it, it is just a sequence of numbers. On a boring Sunday evening, what could be more thrilling to look into this little bit further 🙂
Let’s kick R Studio and start writing a lot of useless stuff.

Tomaz also compares US versus EU palindromic dates and visualizes the different distributions.

Comments closed

Cloudera and R

Published 2020-01-30 by Kevin Feasel

Ian Cook shows us how Cloudera works for R users:

Our customers love it when they can use familiar syntax to work with data regardless of its size or its source. The popularity of sparklyr is a case in point: it enables R users to use either SQL or dplyr—both familiar to most R users—to work with large-scale data using Apache Spark. Two R packages developed at Cloudera—implyr and tidyquery—aim to provide this same choice of either SQL or dplyr when querying tables with Apache Impala and when manipulating R data frames.

The implyr package is new to me, but looks interesting.

Comments closed

R for Systems Administration

Published 2020-01-29 by Kevin Feasel

Ian Flores Siaca says hey, why not use R for systems administration:

jrDroplet is a package designed specifically to manage Virtual Machines in Digital Ocean for our training courses. The idea is that with a single line we are able to create a Digital Ocean droplet with the packages installed for our courses, hiding all of the background complexities related to infrastructure.

It’s an interesting use for this DSL.

Comments closed

Explaining Black Box Models with LIME

Published 2020-01-27 by Kevin Feasel

Holger von Jouanne-Diedrich takes us through the intuition of LIME:

There is a new hot area of research to make black-box models interpretable, called Explainable Artificial Intelligence (XAI), if you want to gain some intuition on one such approach (called LIME), read on!
Before we dive right into it it is important to point out when and why you would need interpretability of an AI. While it might be a desirable goal in itself it is not necessary in many fields, at least not for users of an AI, e.g. with text translation, character and speech recognition it is not that important why they do what they do but simply that they work.
In other areas, like medical applications (determining whether tissue is malignant), financial applications (granting a loan to a customer) or applications in the criminal-justice system (gauging the risk of recidivism) it is of the utmost importance (and sometimes even required by law) to know why the machine arrived at its conclusions.
One approach to make AI models explainable is called LIME for Local Interpretable Model-Agnostic Explanations. There is already a lot in this name!

LIME is not trivial to use and it can be very slow, but it is a great way to visualize models.

Comments closed

Generating Synthetic Data with R

Published 2020-01-24 by Kevin Feasel

Sidharth Macherla uses the conjurer package in R to generate synthetic data:

If you are building data science applications and need some data to demonstrate the prototype to a potential client, you will most likely need synthetic data. In this article, we discuss the steps to generating synthetic data using the R package ‘conjurer’.

One of the toughest problems of generating data is making it look realistic enough. It’s one level of difficulty to build “steady-state” data, but if you want data to follow a combination of trend and random walk…that’s when things get dicey. H/T R-Bloggers

Comments closed

Concepts in Support Vector Machines

Published 2020-01-23 by Kevin Feasel

Abhijit Telang takes us through the calculations involved in Support Vector Machines and then gives us an example in R:

So, let’s take that out and we are back to old, classical vector algebra. It’s like a person with a bunch of sticks to figure out which one to lay where in a 2-D plane to separate one class of objects from another, provided class definitions are already known.
The problem is which particular shape and length must be chosen to show maximum contrast between classes.
We need to arrive at a function definition, in such a way that the value a given function takes changes drastically (e.g. from a large positive value to a large negative value).

SVM is often great for two-class classification problems, and different variants also work well for multi-class problems.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Category: R