Press "Enter" to skip to content

Category: R

Useful R Packages for Data Scientists

Tomaz Kastrun has a nice collection of useful R packages for data scientists:

Among thousand of R packages available on CRAN (with all the  mirror sites) or Github and any developer’s repository.

Many useful functions are available in many different R packages, many of the same functionalities also in different packages, so it all boils down to user preferences and work, that one decides to use particular package. From the perspective of a statistician and data scientist, I will cover the essential and major packages in sections. And by no means, this is not a definite list, and only a personal preference.

Click through for Tomaz’s recommendations.

Comments closed

R Checkpoint Package Update Now in Beta

Hong Ooi announces that a revamp of the checkpoint package is now in beta:

Checkpoint has been around for nearly 6 years now, helping R users solve the reproducible research puzzle. In that time, it’s seen many changes, new features, and, inevitably, bug reports. Some of these bugs have been fixed, while others remain outstanding in the too-hard basket.

Many of these issues spring from the fact that it uses only base R functions, in particular install.packages, to do its work. The problem is that install.packages is meant for interactive use, and as an API, is very limited. For starters, it doesn’t return a result to the caller—instead, checkpoint has to capture and parse the printed output to determine whether the installation succeeded. This causes a host of problems, since the printout will vary based on how R is configured. Similarly, install.packages refuses to install a package if it’s in use, which means checkpoint must unload it first—an imperfect and error-prone process at best.

In addition to these, checkpoint’s age means that it has accumulated a significant amount of technical debt over the years. For example, there is still code to handle ancient versions of R that couldn’t use HTTPS, even though the MRAN site (in line with security best practice) now accepts HTTPS connections only.

Click through to see what’s in the new checkpoint package.

Comments closed

Time Series Forecasting Best Practices

David Smith talks about a new GitHub repo:

The repository includes detailed examples of various time series modeling techniques, as Jupyter Notebooks for Python, and R Markdown documents for R. It also includes Python notebooks to fit time series models in the Azure Machine Learning service, and then operationalize the forecasts as a web service.

The R examples demonstrate several techniques for forecasting time series, specifically data on refrigerated orange juice sales from 83 stores (sourced from the the bayesm package). The forecasting techniques vary (mean forecasting with interpolation, ARIMA, exponential smoothing, and additive models), but all make extensive use of the tidyverts suite of packages, which provides “tidy time series forecasting for R“. The forecasting methods themselves are explained in detail in the book (readable online) Forecasting: Principles and Practice by Rob J Hyndman and George Athanasopoulos (Monash University).

This looks really cool.

Comments closed

Changing the Graphics Device in RMarkdown Docs

Colin Gillespie shows us how to change PDF and PNG output settings within knitr:

In many workflows, function calls to graphic devices are not explicit. Instead, the call is made by another package, such as knitr.

When kniting an Rmarkdown document, the default graphics device when creating PDF documents is grDevices::pdf() and for HTML documents it’s grDevices::png(). As we demostrated, these are the worst possible choices!

Click through to see what you can do about it.

Comments closed

Tidy Simulation of Stochastic Processes in R

David Robinson shows off my favorite distribution:

The Riddler puzzle describes a Poisson process, which is one of the most important stochastic processes. A Poisson process models the intuitive concept of “an event is equally likely to happen at any moment.” It’s named because the number of events occurring in a time interval of length is distributed according to , for some rate parameter (for this puzzle, the rate is described as one per day, ).

How can we simulate a Poisson process? This is an important connection between distributions. The waiting time for the next event in a Poisson process has an exponential distribution, which can be simulated with rexp().

Read on to learn about the Poisson distribution and Yule processes.

Comments closed

Saving Graphics in R Across Multiple OSes

Colin Gillesipie takes us through exporting graphics in R and some of the cross-platform foibles you’ll find:

One of R’s outstanding features is that it is cross platform. You write R code and it magically works under Linux, Windows and Mac. Indeed, the above the code “runs” under all three operating systems. But does it produce the same graphic under each platform? Spoiler! None of the above functions produce identical output across OS’s. So for “same”, I going to take a lax view and I just want figures that look the same.

Read on to understand the differences and hopefully limit confusion around them.

Comments closed

Logging in R

Himanshu Gupta walks us through the log4r package:

One of the most important aspect of an application is Logging. Since logs provide visibility into the behavior of a running app. Hence logs play a vital role in maintenance and enhancement of an application.

However, most of us are already aware with the importance of logging. That’s why we add them in our applications. But one thing that we are not aware of is that, the application should never be concerned with routing or storage of logs, i.e., it should not attempt to write to or manage logs or log files. Instead, each running process, within the application, writes logs to a stdout. In local environment, we can view the logs in the console whereas in staging/production environment, logs can be collated together in .log file(s).

Hence, in this blog post we will learn – how to collect, customize, and standardize R logs using log4r? But first let’s know what log4r is.

Read on for a demonstration of log4r and some of the settings you can choose.

Comments closed

VARCHAR Columns and Bytecode Version Mismatch in R

Dave Mason runs through a tricky problem with SQL Server Machine Learning Services:

During my testing, I’ve found R handles CHAR and VARCHAR data within the input data set as long as the ASCII codes comprising the data is in the range from 0 to 127. This much is not surprising–those are the character codes for the ASCII table. Starting with character code 128, R begins having some trouble. 

Read on to see the problem. Dave’s advice at the end is sound (and frankly, my advice for any string data in SQL Server).

Comments closed