Press "Enter" to skip to content

Category: R

Vectorized R I/O in Apache Spark 3.0

Hyukjin Kwon gives us a preview of SparkR improvements in Apache Spark 3.0:

When SparkR does not require interaction with the R process, the performance is virtually identical to other language APIs such as Scala, Java and Python. However, significant performance degradation happens when SparkR jobs interact with native R functions or data types.

Databricks Runtime introduced vectorization in SparkR to improve the performance of data I/O between Spark and R. We are excited to announce that using the R APIs from Apache Arrow 0.15.1, the vectorization is now available in the upcoming Apache Spark 3.0 with the substantial performance improvements.

This blog post outlines Spark and R interaction inside SparkR, the current native implementation and the vectorized implementation in SparkR with benchmark results.

Certain operations get ridiculously faster with this change.

Comments closed

Installing TensorFlow and Keras for R on SQL Server 2019 ML Services

I have a post on using TensorFlow and Keras in R on SQL Server 2019 Machine Learning Services:

What I’m doing is building a new virtual environment named r-reticulate, which is what the reticulate package in R desires. Inside that virtual environment, I’m installing the latest versions of tensorflow-probabilitytensorflow , and keras. I had DLL loading problems with TensorFlow 2.1 on Windows, so if you run into those, the proper solution is to ensure that you have the appropriate Visual C++ redistributables installed on your server.

Then, I switched back to the base virtual environment and installed the same packages. My thinking here is that I’ll probably need them for other stuff as well (and don’t tell anybody, but I’m not very good with Python environments).

Please continue not to tell anybody that I’m not very good with Python environments. I tend to dump things in the base environment, forget which one I’m in, and all kinds of other bad practices. I think I’m secretly undermining myself in Python, but I don’t have enough proof yet.

Comments closed

Evolutionary Algorithms for Color Palette Discovery

Daniel Oehm combines two interests:

Colour theory is pretty complex stuff so choosing a good palette isn’t easy, let alone evolving one. So, you’re going to have some hits and some misses. This is definitely more for fun seeing what you discover rather than finding the perfect palette. Having said that you could discover some gold!

There are best practices when choosing a palette for data visualisation depending on the context and what is to be shown. For example people tend to respond to certain colours representing high / low, hot / cold or good / bad, there is also colourblindness considerations. evoPalette won’t necessarily adhere to these ideals.

I’d like to see a genetic algorithms approach, though you’d have to define some sort of function to score each outcome, so I can see how that’d be tricky. H/T R-Bloggers

Comments closed

Setting Up Your Own R Package Repository

Steve Belcher explains how to configure a custom package repository in your environment:

One of the strengths of the R language is the thousands of third-party packages that have been made publicly available via CRAN, the Comprehensive R Archive Network. R includes several functions that make it easy to download and install these packages. However, in many enterprise environments, access to the Internet is limited or non-existent. In such environments, it is useful to create a local package repository that users can access from within the corporate firewall.

Your local repository may contain source packages, binary packages, or both. If at least some of your users will be working on Windows systems, you should include Windows binaries in your repository. Windows binaries are R-version-specific; if you are running R 3.3.3, you need Windows binaries built under R 3.3. These versioned binaries are available from CRAN and other public repositories. If at least some of your users will be working on Linux systems, you must include source packages in your repository.

There are some tools which help out with this, so read the whole thing.

Comments closed

R 4.0 Improvements: stopifnot()

Bob Rudis looks at one of the R 4.0 changes hidden in the changelog:

R 4.0.0 has been out for a while, now, and — apart from a case where merge() was slower than dirt — it’s been really stable for at least me (I use it daily on macOS, Linux, and Windows). Sure, it came with some headline-grabbing features/upgrades, but I’ve started looking at what other useful nuggets might be in the changelog and decided to blog them as I find them.

Today’s nugget is the venerable stopifnot() function which was significantly enhanced by this PR by Neil Fultz.

Read on for a quality of life improvement with error handling in R.

Comments closed

Text Mining and Sentiment Analysis in R

Sanil Mhatre walks us through a sentiment analysis scenario in R:

Sentiments can be classified as positive, neutral or negative. They can also be represented on a numeric scale, to better express the degree of positive or negative strength of the sentiment contained in a body of text.

This example uses the Syuzhet package for generating sentiment scores, which has four sentiment dictionaries and offers a method for accessing the sentiment extraction tool developed in the NLP group at Stanford. The get_sentiment function accepts two arguments: a character vector (of sentences or words) and a method. The selected method determines which of the four available sentiment extraction methods will be used. The four methods are syuzhet (this is the default), bingafinn and nrc. Each method uses a different scale and hence returns slightly different results. Please note the outcome of nrc method is more than just a numeric score, requires additional interpretations and is out of scope for this article. The descriptions of the get_sentiment function has been sourced from : https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html?

Comments closed

Accessing Azure Queue Storage from R

Hong Ooi announces a new package for R:

This post is to announce that the AzureQstor package is now on GitHub. AzureQstor provides an R interface to Azure queue storage, building on the facilities provided by AzureStor.

Queue Storage is a service for storing large numbers of messages, for example from automated sensors, that can be accessed remotely via authenticated calls using HTTP or HTTPS. A single queue message can be up to 64 KB in size, and a queue can contain millions of messages, up to the total capacity limit of a storage account. Queue storage is often used to create a backlog of work to process asynchronously.

Hong includes a couple of demos as well, so check it out.

Comments closed

Solving the Prisoner Coin Flipping Puzzle with R

David Robinson takes us through another problem-solving challenge:

You are locked in the dungeon of a faraway castle with three fellow prisoners (i.e., there are four prisoners in total), each in a separate cell with no means of communication. But it just so happens that all of you are logicians (of course)….

Each prisoner will be given a fair coin, which can either be fairly flipped one time or returned to the guards without being flipped. If all flipped coins come up heads, you will all be set free! But if any of the flipped coins comes up tails, or if no one chooses to flip a coin, you will all be doomed to spend the rest of your lives in the castle’s dungeon.

The only tools you and your fellow prisoners have to aid you are random number generators, which will give each prisoner a random number, uniformly and independently chosen between zero and one.

What are your chances of being released?

I’ll solve this with tidy simulation in R, in particular using one of my favorite functions, tidyr’s crossing(). In an appendix, I’ll show how to get a closed form solution for N = 4.

I’ve also posted a 30-minute screencast of how I first approached the simulation and visualization.

Click through for the solution and explanation.

Comments closed

R 4.0 Released

David Smith walks us through what’s new in R 4.0:

R 4.0.0 was released in source form on Friday, and binaries for Windows, Mac and Linux are available for download now.

As the version number bump suggests, this is a major update to R that makes some significant changes. Some of these changes — particularly the first one listed below — are likely to affect the results of R’s calculations, so I would not recommend running scripts written for prior versions of R without validating them first. In any case, you’ll need to reinstall any packages you were using for R 4.0.0. (You might find this R script useful for checking what packages you have installed for R 3.x.)

And I just got 3.6 into production yesterday. Them’s the breaks…

Comments closed

The Siren Song of High Accuracy

Holger von Jouanne-Diedrich notes that accuracy is not in itself necessarily a good thing for a machine learning model:

In one of my most popular posts So, what is AI really? I showed that Artificial Intelligence (AI) basically boils down to autonomously learned rules, i.e. conditional statements or simply, conditionals.

In this post, I create the simplest possible classifier, called ZeroR, to show that even this classifier can achieve surprisingly high values for accuracy (i.e. the ratio of correctly predicted instances)… and why this is not necessarily a good thing, so read on!

The nuanced answer here is that with classifiers, accuracy is not in itself a great measure in the case of class imbalance. The more balanced your classes are, the more likely it is that a model with high accuracy is a good model. That’s where other measures such as specificity and sensitivity, positive & negative predictive value, etc. come into play.

Comments closed