R – Page 67 – Curated SQL

One of R’s outstanding features is that it is cross platform. You write R code and it magically works under Linux, Windows and Mac. Indeed, the above the code “runs” under all three operating systems. But does it produce the same graphic under each platform? Spoiler! None of the above functions produce identical output across OS’s. So for “same”, I going to take a lax view and I just want figures that look the same.

Read on to understand the differences and hopefully limit confusion around them.

Comments closed

Logging in R

Published 2020-04-07 by Kevin Feasel

Himanshu Gupta walks us through the log4r package:

One of the most important aspect of an application is Logging. Since logs provide visibility into the behavior of a running app. Hence logs play a vital role in maintenance and enhancement of an application.
However, most of us are already aware with the importance of logging. That’s why we add them in our applications. But one thing that we are not aware of is that, the application should never be concerned with routing or storage of logs, i.e., it should not attempt to write to or manage logs or log files. Instead, each running process, within the application, writes logs to a stdout. In local environment, we can view the logs in the console whereas in staging/production environment, logs can be collated together in .log file(s).
Hence, in this blog post we will learn – how to collect, customize, and standardize R logs using log4r? But first let’s know what log4r is.

Read on for a demonstration of log4r and some of the settings you can choose.

Comments closed

VARCHAR Columns and Bytecode Version Mismatch in R

Published 2020-04-06 by Kevin Feasel

Dave Mason runs through a tricky problem with SQL Server Machine Learning Services:

During my testing, I’ve found R handles CHAR and VARCHAR data within the input data set as long as the ASCII codes comprising the data is in the range from 0 to 127. This much is not surprising–those are the character codes for the ASCII table. Starting with character code 128, R begins having some trouble.

Read on to see the problem. Dave’s advice at the end is sound (and frankly, my advice for any string data in SQL Server).

Comments closed

Generating Random Numbers with R

Published 2020-04-03 by Kevin Feasel

The folks at Data Sharkie walk us through random number generation in R:

Why is random numbers generation important and where is it used?
Random numbers generations have application in various fields like statistical sampling, simulation, test designs, and so on. Generally, when a data scientist is in need of a set of random numbers, they will have in mind
R programming language allows users to generate random distributed numbers with a set of built-in functions: runif(), rnorm(), rbinom().

Read on to generate random numbers across two separate distributions.

Comments closed

Visualizing a Single Variable in R

Published 2020-04-02 by Kevin Feasel

Michaelino Mervisiano takes us through the types of visuals we can create to understand a single variable in R:

How to create a histogram in R? And what information that we can get from histogram?
Histogram shows a frequency distribution. It is a great graph for showing the mode, the spread, and the symmetry (skewness) of your data. Here is a histogram of 1,000 random points drawn from a normal distribution with a mean of 2.5

Of course I don’t like option number 4 and would replace it with something else (column/bar charts, Cleveland dot plots, or stacked column/bar depending on what you’re trying to observe). But this is a good way of thinking about how you can visualize a variable.

Comments closed

Tuning Random Forest HyperParameters with R

Published 2020-03-30 by Kevin Feasel

Julia Silge gives us an idea of how to tune random forest hyperparameters in R:

Our modeling goal here is to predict the legal status of the trees in San Francisco in the #TidyTuesday dataset. This isn’t this week’s dataset, but it’s one I have been wanting to return to. Because it seems almost wrong not to, we’ll be using a random forest model! 🌳
Let’s build a model to predict which trees are maintained by the San Francisco Department of Public Works and which are not. We can use parse_number() to get a rough estimate of the size of the plot from the plot_size column. Instead of trying any imputation, we will just keep observations with no NA values.

Click through to some data exploration, the initial model, and a process for using Grid Search with the caret package.

Comments closed

Using the Tune Package in R for Hyperparamter Optimization

Published 2020-03-30 by Kevin Feasel

Abderrahim Lyoubi-Idrissi takes us through a Bayesian approach to tune hyperparameters:

In contrast to the model parameters, which are discovered by the learning algorithm of the ML model, the so called Hyperparameter(HP) are not learned during the modeling process, but specified prior to training.
Hyperparameter tuning is the task of finding optimal hyperparameter(s) for a learning algorithm for a specific data set and at the end of the day to improve the model performance.

Abderrahim contrasts two different methods here: Grid Search and Bayesian Optimization. Definitely an interesting read if you develop data science models.

Comments closed

Removing Rows with Missing Values in R

Published 2020-03-25 by Kevin Feasel

The folks at Data Sharkie show off one of my favorite tricks for removing observations with missing data from R:

In this article we will learn how to remove rows with NA from dataframe in R. We will walk through a complete tutorial on how to treat missing values using complete.cases() function in R.

The SQL equivalent to this is much lengthier.

Comments closed

Faster Package Installation in R

Published 2020-03-24 by Kevin Feasel

Colin Gillespie has a few tips for making package installation in R a bit faster:

The bigger picture is that package installation time is starting to become more of an issue for a number of reasons. For example, packages are getting larger and more complex (tidyverse and friends), so installation just takes longer. Or we are using more continuous integration strategies such as Travis or GitLab-CI, and want quick feedback. Or we are simply updating a large number of packages via update.packages(). This is a problem we often solve for our clients – optimising their CI/CD pipelines.
The purpose of this blog post is to pull together a few different methods for tackling this problem.

Click through for the guidance.

Comments closed

Color Palettes in R

Published 2020-03-23 by Kevin Feasel

Paul van der Laken talks to us about paleteer:

I often cover tools to pick color palettes on my website (e.g. here, here, or here) and also host a comprehensive list of color packages in my R programming resources overview.
However, paletteer is by far my favorite package for customizing your colors in R!
The paletteer package offers direct access to 1759 color palettes, from 50 different packages!

Just make sure to run your graphics through something like Coblis afterward to ensure that they’re CVD-friendly. H/T R-Bloggers.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: R

Saving Graphics in R Across Multiple OSes

Logging in R

VARCHAR Columns and Bytecode Version Mismatch in R

Generating Random Numbers with R

Visualizing a Single Variable in R

Tuning Random Forest HyperParameters with R

Using the Tune Package in R for Hyperparamter Optimization

Removing Rows with Missing Values in R

Faster Package Installation in R

Color Palettes in R