Press "Enter" to skip to content

Category: R

Visualizing Stock Data With Lares

Bernando Lares shows off some stuff his lares can do around visualizing time series data:

The overall idea of these functions is to visualize your stocks and portfolio’s performance with a just a few lines of simple code. I’ve created individual functions for each of the calculations and plots, and some other functions that gather all of them into a single list of objects for further use.

On the other hand, the lares package is “my personal library used to automate and speed my everyday work on Analysis and Machine Learning tasks”. I am more than happy to share it with you for your personal use. Feel free to install, use, and comment on any of its code and functionalities and I’ll happy to help you with it. I have previously shared other uses of the library in other posts which might also interest you: Visualizing ML Results (binary)Visualizing ML Results (continuous)and AutoML to understand datasets.

  • NOTE 1: The following post was written by a non-economist or professional investor. I am open to your comments and technical corrections if needed. Glad to learn as always!

  • NOTE 2: I will be using the less customizable functions in this post so we can focus more on the outputs than in the coding part; but once again, feel free to use the functions and dive into the library to understand or change them!

  • NOTE 3: All currency units are USD ($).

It does seem to be easy to use for this scenario.

Comments closed

What’s New With Machine Learning Services

Niels Berglund looks at SQL Server 2019’s Machine Learning Services offering for updates:

So, when I read What’s new in SQL Server 2019, I came across a lot of interesting “stuff”, but one thing that stood out was Java language programmability extensions. In essence, it allows us to execute Java code in SQL Server by using a pre-built Java language extension! The way it works is as with R and Python; the code executes outside of the SQL Server engine, and you use sp_execute_external_script as the entry-point.

I haven’t had time to execute any Java code as of yet, but in the coming days, I definitely will drill into this. Something I noticed is that the architecture for SQL Server Machine Learning Services has changed (or had additions to it).

That Java support is for Spark, I’d imagine.  And I hope they allow for Scala.

Comments closed

Rayshader: 3D Surface Plotting In R

David Smith looks at an interesting package in R:

Tyler describes the rayshader package in a gorgeous blog post: his goal was to generate 3-D representations of landscape data that “looked like a paper weight”. (Incidentally, you can use this package to produce actual paper weights with 3-D printing.) To this end, he went beyond simply visualizing a 3-D surface in rgl and added a rectangular “base” to the surface as well as shadows cast by the geographic features. He also added support for detecting (or specifying) a water level: useful for representing lakes or oceans (like the map of the Monterey submarine canyon shown below) and for visualizing the effect of changing water levels like this animation of draining Lake Mead.

It looks great.

Comments closed

Hadoop + SQL Server In 2019

Travis Wright shows off a big part of what the SQL Server team has been working on the last couple of years:

SQL Server 2019 big data clusters provide a complete AI platform. Data can be easily ingested via Spark Streaming or traditional SQL inserts and stored in HDFS, relational tables, graph, or JSON/XML. Data can be prepared by using either Spark jobs or Transact-SQL (T-SQL) queries and fed into machine learning model training routines in either Spark or the SQL Server master instance using a variety of programming languages, including Java, Python, R, and Scala. The resulting models can then be operationalized in batch scoring jobs in Spark, in T-SQL stored procedures for real-time scoring, or encapsulated in REST API containers hosted in the big data cluster.

SQL Server big data clusters provide all the tools and systems to ingest, store, and prepare data for analysis as well as to train the machine learning models, store the models, and operationalize them.
Data can be ingested using Spark Streaming, by inserting data directly to HDFS through the HDFS API, or by inserting data into SQL Server through standard T-SQL insert queries. The data can be stored in files in HDFS, or partitioned and stored in data pools, or stored in the SQL Server master instance in tables, graph, or JSON/XML. Either T-SQL or Spark can be used to prepare data by running batch jobs to transform the data, aggregate it, or perform other data wrangling tasks.

Data scientists can choose either to use SQL Server Machine Learning Services in the master instance to run R, Python, or Java model training scripts or to use Spark. In either case, the full library of open-source machine learning libraries, such as TensorFlow or Caffe, can be used to train models.

Lastly, once the models are trained, they can be operationalized in the SQL Server master instance using real-time, native scoring via the PREDICT function in a stored procedure in the SQL Server master instance; or you can use batch scoring over the data in HDFS with Spark. Alternatively, using tools provided with the big data cluster, data engineers can easily wrap the model in a REST API and provision the API + model as a container on the big data cluster as a scoring microservice for easy integration into any application.

I’ve wanted Spark integration ever since 2016 and we’re going to get it.

Comments closed

Formatting Summary Tables In R

Laura Ellis shows us how to create formatted tables using the formattable package in R:

We are going to narrow down the data set to focus on 4 key health metrics. Specifically the prevalence of obesity, tobacco use, cardiovascular disease and obesity. We are then going to select only the indicator name and yearly KPI value columns. Finally we are going to make extra columns to display the 2011 to 2016 yearly average and the 2011 to 2016 metric improvements.

Tables are an area of data visualization that we tend to forget at our own peril.

Comments closed

Tol Color Schemes In R

Jason C. Fisher walks us through a color scheme generator based on Paul Tol’s research;

Choosing colors for a graphic is a bit like taking a trip down the rabbit hole, that is, it can take much longer than expected and be both fun and frustrating at the same time. Striking a balance between colors that look good to you and your audience is important. Keep in mind that color blindness affects many individuals throughout the world and it is incumbent on you to choose a color scheme that works in color-blind vision. Luckily there are a number of excellent R packages that address this very issue, such as the colorspace,RColorBrewer, and viridis packages. And because this is R, where diversity is king, why not offer one more function for creating color blind friendly palettes.

Let me introduce the GetTolColors function in the R-package inlmisc. This function generates a vector of colors from qualitative, diverging, and sequential color schemes by Paul Tol (2018). The original inspiration for developing this function came from Peter Carl’s blog post describing color schemes from an older issue of Paul Tol’s Technical Note (issue 2.2, released Dec. 2012). And the qualitative color schemes described in his blog post found their way into the ptol_pal function in the R-package ggthemes. My intent with this document is to exhibit the latest Tol color schemes (issue 3.0, released May 2018) and show that they are not only visually pleasing but also well thought out.

Read on for step-by-step instructions and to see some of the palettes.  The package authors have taken care in color design, so check it out.

Comments closed

Labeling Line Ends In ggplot2

Simon Jackson shows how you can use the secondary axis to label line endings in ggplot2:

Now we can use scale_y_*, with the argument sec.axis to create a second axis on the right, with numbers to be displayed at breaks, defined by our vector of line ends:

ggplot(d, aes(age, circumference, color = Tree)) +
      geom_line() +
      scale_y_continuous(sec.axis = sec_axis(~ ., breaks = d_ends))

This is good.  I’d really prefer to show the labels instead of the value; that way it’d be possible to eliminate the legend altogether.  H/T R-Bloggers.

Comments closed

R From The Year 2000

Colin Gillespie takes us down memory lane with some old, old code:

Last week I spent some time reminiscing about my PhD and looking through some old R code. This trip down memory lane led to some of my old R scripts that amazingly still run. My R scripts were fairly simple and just created a few graphs. However now that I’ve been programming in R for a while, with hindsight (and also things have changed), my original R code could be improved.

I wrote this code around April 2000. To put this into perspective,

  • R mailing list was started in 1997
  • R version 1.0 was released in Feb 29, 2000
  • The initial release of Git was in 2005
  • Twitter started in 2006
  • StackOverflow was launched in 2008

Basically, sharing code and getting help was much more tricky than today – so cut me some slack!

It’s a good sign when an arbitrary task becomes easier to understand as a language evolves.  And I’m glad they dumped the underscore assignment operator.

Comments closed

Reticulate: Python-R Interop

Adnan Fiaz walks us through an example of using the reticulate library to call Python from R:

So what exactly does reticulate do? It’s goal is to facilitate interoperability between Python and R. It does this by embedding a Python session within the R session which enables you to call Python functionality from within R. I’m not going to go into the nitty gritty of how the package works here; RStudio have done a great job in providing some excellent documentation and a webinar. Instead I’ll show a few examples of the main functionality.

Just like R, the House of Python was built upon packages. Except in Python you don’t load functionality from a package through a call to librarybut instead you import a module. reticulate mimics this behaviour and opens up all the goodness from the module that is imported.

This is a good intro to a package which is already useful but I think will be even better over time as R & Python interoperability becomes the norm.  H/T R-Bloggers

Comments closed

Working With Data Frames In R

Dave Mason has a couple of blog posts on data frames.  First, the basics:

Conceptually, a dataset is a grid or table of data elements. It consists of rows, which we specifically call “observations”, and of columns , which are called “variables”. (Observations may also be referred to as “instances”. Variables may also be referred to as “properties”.) The data frame in R is designed for data sets. As the R documentation tells us, data frames are “used as the fundamental data structure by most of R’s modeling software”.

The function we’ll be working with primarily in this post is the data.frame() function. I have read that in R programming, creating data frames with this function is rather uncommon. Most of the time, data frames are created by invoking other functions that read data from an external data source (like a file or a database table) with a data frame as the return type. But for simplicity, data.frame() will serve our purposes.

Then, subsetting data frames:

Adding columns to a data frame is easy–easy compared to adding rows. We’ll get to that. To add a column, first create a vector. The class doesn’t matter. But the number of elements does–it has to match the number of observations in the data frame. Now that we have our vector, here are some options to add it as a new column to a data frame: use the $ shortcut, use double brackets with the new column name, bind the vector to the dataframe with cbind().

The data frame (or tibble, if using the tidyverse version) is probably the single most important data type in R for getting work done.

Comments closed