Press "Enter" to skip to content

Category: R

Air Travel Route Maps With ggplot2

Peter Prevos wants to create a pretty map of flights he’s taken:

The first step was to create a list of all the places I have flown between at least once. Paging through my travel photos and diaries, I managed to create a pretty complete list. The structure of this document is simply a list of all routes (From, To) and every flight only gets counted once. The next step finds the spatial coordinates for each airport by searching Google Maps using the geocode function from the ggmap package. In some instances, I had to add the country name to avoid confusion between places.

The end result is imperfect (as Peter mentions, ggmap isn’t wrapping around), but does fit the bill for being eye-catching.

Comments closed

replyr

John Mount shows off replyr, which is dplyr for remote, distributed data sets (think SparkR or sparklyr):

Suppose we had a large data set hosted on a Spark cluster that we wished to work with using dplyr and sparklyr (for this article we will simulate such using data loaded into Spark from the nycflights13 package).

We will work a trivial example: taking a quick peek at your data. The analyst should always be able to and willing to look at the data.

It is easy to look at the top of the data, or any specific set of rows of the data.

Read on for more details.

Comments closed

R 3.3.3 Released

David Smith alerts us to R 3.3.3:

The R core group announced today the release of R 3.3.3 (code-name: “Another Canoe”). As the wrap-up release of the R 3.3 series, this update mainly contains minor bug-fixes. (Bigger changes are planned for R 3.4.0, expected in mid-April.) Binaries for the Windows version are already up on the CRAN master site, and binaries for all platforms will appear on your local CRAN mirror within the next couple of days.

For now, I’m holding out until R 3.4.0.

Comments closed

Using Prophet For Stock Price Predictions

Marcelo Perlin looks at Facebook’s Prophet to see if it works well for predicting stock price movements:

The previous histogram shows the total return from randomly generated signals in 10^{4} simulations. The vertical line is the result from using prophet. As you can see, it is a bit higher than the average of the distribution. The total return from prophet is lower than the return of the naive strategy in 27.5 percent of the simulations. This is not a bad result. But, notice that we didn’t add trading or liquidity costs to the analysis, which will make the total returns worse.

The main results of this simple study are clear: prophet is bad at point forecasts for returns but does quite better in directional predictions. It might be interesting to test it further, with more data, adding trading costs, other forecasting setups, and see if the results hold.

This is a very interesting article, worth reading.  H/T R Bloggers

Comments closed

htmlwidgets

David Smith writes about the htmlwidgets gallery in R:

While R’s base graphics library is almost limitlessly flexible when it comes to create static graphics and data visualizations, new Web-based technologies like d3 and webgl open up new horizons in high-resolution, rescalable and interactive charts. Graphics built with these libraries can easily be embedded in a webpage, can be dynamically resized while maintaining readable fonts and clear lines, and can include interactive features like hover-over data tips or movable components. And thanks to htmlwidgets for R, you can easily create a variety of such charts using R data and functions, explore them in an interactive R session, and include them in Web-based applications for others to experience.

There are some nice widgets in this set.

Comments closed

Frequency Tables

Mala Mahadevan shows how to generate a frequency table in T-SQL and in R:

My results are as below. I have 1000 records in the table. This tells me that I have 82 occurences of age cohort 0-5, 8.2% of my dataset is from this bracket, 82 again is the cumulative frequency since this is the first record and 8.2 cumulative percent. For the next bracket 06-12 I have 175 occurences, 17.5 %, 257 occurences of age below 12, and 25.7 % of my data is in this age bracket. And so on.

Click through for the T-SQL and R scripts.

Comments closed

Figuring Out Cost Threshold For Parallelism

Grant Fritchey uses R to help him decide on a good cost threshold for parallelism value:

With the Standard Deviation in hand, and a quick rule of thumb that says 68% of all values are going to be within two standard deviations of the data set, I can determine that a value of 16 on my Cost Threshold for Parallelism is going to cover most cases, and will ensure that only a small percentage of queries go parallel on my system, but that those which do go parallel are actually costly queries, not some that just fall outside the default value of 5.

I’ve made a couple of assumptions that are not completely held up by the data. Using the two, or even three, standard deviations to cover just enough of the data isn’t actually supported in this case because I don’t have a normal distribution of data. In fact, the distribution here is quite heavily skewed to one end of the chart. There’s also no data on the frequency of these calls. You may want to add that into your plans for setting your Cost Threshold.

This is a nice start.  If you’re looking for a more experimental analysis, you could try A/B testing (particularly if you have a good sample workload), where you track whatever pertinent counters you need (e.g., query runtime, whether it went parallel, CPU and disk usage) under different cost threshold regimes and do a comparative analysis.

Comments closed

Cognitive Services With R

Steph Locke shows how to use the Microsoft Cognitive Services Text Analytics API within R:

We have some different languages and we need to first do language detection before we can analyse the sentiment of our phrases

# Construct a request
response<-POST(cogapi, 
               add_headers(`Ocp-Apim-Subscription-Key`=cogapikey),
               body=toJSON(mydata))

Now we need to consume our response such that we can add the language code to our existing data.frame. The structure of the response JSON doesn’t play well with others so I use data.table’s nifty rbindlist. It is a **very good* candidate for purrr but I’m not up to speed on that yet.

Check out the whole post; Steph makes it look easy.

Comments closed

Building A Neural Net

Shirin Glander has a great post on using Spark + sparklyr + h2o + rsparkling to build a neural net to study arrhythmia of the heart:

The data I am using to demonstrate the building of neural nets is the arrhythmia dataset from UC Irvine’s machine learning database. It contains 279 features from ECG heart rhythm diagnostics and one output column. I am not going to rename the feature columns because they are too many and the descriptions are too complex. Also, we don’t need to know specifically which features we are looking at for building the models. For a description of each feature, see https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.names. The output column defines 16 classes: class 1 samples are from healthy ECGs, the remaining classes belong to different types of arrhythmia, with class 16 being all remaining arrhythmia cases that didn’t fit into distinct classes.

Very interesting post.

Comments closed

ggraph

David Smith has a post on a new R package to display graphs:

A graph, a collection of nodes connected by edges, is just data. Whether it’s a social network (where nodes are people, and edges are friend relationships), or a decision tree (where nodes are branch criteria or values, and edges decisions), the nature of the graph is easily represented in a data object. It might be represented as a matrix (where rows and columns are nodes, and elements mark whether an edge between them is present) or as a data frame (where each row is an edge, with columns representing the pair of connected nodes).

The trick comes in how you represent a graph visually; there are many different options each with strengths and weaknesses when it comes to interpretation. A graph with many nodes and edges may become an unintelligible hairball without careful arrangement, and including directionality or other attributes of edges or nodes can reveal insights about the data that wouldn’t be apparent otherwise. There are many R packages for creating and displaying graphs (igraph is a popular one, and this CRAN task view lists many others) but that’s a problem in its own right: an important part of the data exploration process is trying and comparing different visualization options, and the myriad packages and interfaces makes that process difficult for graph data.

Click through for more information as well as a mesmerizing animated image.

Comments closed