Category: R

A Gamma distribution is useful for modeling positive, right skewed data such as waiting times; it is a continuous function.
In this post, we’ll illustrate some properties of the Gamma distribution by simulating a toy example.

Click through for the example.

Comments closed

Data Visualization in R

Published 2020-06-19 by Kevin Feasel

Dan Fitton provides an introductory overview to several visualization tools in R:

The other way to communicate data with R is to produce an interactive dashboard or web application within R using Shiny. Whereas Markdown reports are most useful for explanatory analysis; Shiny, in my opinion, is useful for exploratory data analysis. This is when you want to display information for investigative purposes, allowing the user to gain greater familiarity by having the ability to interact with data, filter it, and dig deeper into the underlying details.
Shiny is incredibly flexible, providing the user the capability of turning their R code and objects, including tables, plots, and analysis, into a comprehensive and interactive web page or app, without requiring a fully-fledged web development skillset. Although there is a steep learning curve, the freedom and precision Shiny brings means that for the most part you are limited only by your skillset rather than the tool itself.

I’ve seen some really useful Shiny dashboards. Dan is right that there can be a lot of work put into getting them right, but if you do, the results can be outstanding.

Comments closed

Text Customization with ggtext

Published 2020-06-17 by Kevin Feasel

Abdul Majed Raja shows an example of using the ggtext library:

ggplot2 is go-to R package for anyone who wants to make beautiful static visualizations in R. But most ggplot2 gplots look almost the same and little many data analysts or data scientists care about customizing it, primarily because it’s not very intuitive to do so. That’s where ggplot2 extensions come in very handy. ggtext is an R package (by Claus O. Wilke) that helps in customizing the text present in ggplot2 plots. It could be the text outside the plot canvas or the text (annotation) within the plot canvas.

Click through for the code sample and video. H/T R-Bloggers.

Comments closed

The Basics of A/B Testing with R

Published 2020-06-15 by Kevin Feasel

Holger von Jouanne-Diedrich walks us through a simple example of A/B testing and analysis using R:

The bad news is, that you have to understand a little bit about statistical hypothesis testing, the good news is that if you read the following post, you have everything you need (plus, as an added bonus R has all the tools you need already at hand!): From Coin Tosses to p-Hacking: Make Statistics Significant Again! (ok, reading it would make it over one minute…).

Check out that article and the example in the blog post as well. R makes it really easy to perform this sort of analysis.

Comments closed

Obfuscating Data in SQL Server

Published 2020-06-15 by Kevin Feasel

Dave Mason has a data obfuscator:

In a previous post, I explored an option for generating fake data in sql server using Machine Learning services and the R language. I’ve expanded on that by creating some stored procedures that can be used for both generating data sets of fake data, and for obfuscating existing SQL Server data with fake data.
The code is available in a Github repository. For now, it consists of ten stored procedures.

Unlike something like Dynamic Data Masking, this is a permanent update to the table. That makes it quite helpful for getting production distributions and use cases into non-production environments.

Comments closed

Vectorized R I/O in Apache Spark 3.0

Published 2020-06-10 by Kevin Feasel

Hyukjin Kwon gives us a preview of SparkR improvements in Apache Spark 3.0:

When SparkR does not require interaction with the R process, the performance is virtually identical to other language APIs such as Scala, Java and Python. However, significant performance degradation happens when SparkR jobs interact with native R functions or data types.
Databricks Runtime introduced vectorization in SparkR to improve the performance of data I/O between Spark and R. We are excited to announce that using the R APIs from Apache Arrow 0.15.1, the vectorization is now available in the upcoming Apache Spark 3.0 with the substantial performance improvements.
This blog post outlines Spark and R interaction inside SparkR, the current native implementation and the vectorized implementation in SparkR with benchmark results.

Certain operations get ridiculously faster with this change.

Comments closed

Installing TensorFlow and Keras for R on SQL Server 2019 ML Services

Published 2020-06-09 by Kevin Feasel

I have a post on using TensorFlow and Keras in R on SQL Server 2019 Machine Learning Services:

What I’m doing is building a new virtual environment named r-reticulate, which is what the reticulate package in R desires. Inside that virtual environment, I’m installing the latest versions of tensorflow-probability, tensorflow , and keras. I had DLL loading problems with TensorFlow 2.1 on Windows, so if you run into those, the proper solution is to ensure that you have the appropriate Visual C++ redistributables installed on your server.
Then, I switched back to the base virtual environment and installed the same packages. My thinking here is that I’ll probably need them for other stuff as well (and don’t tell anybody, but I’m not very good with Python environments).

Please continue not to tell anybody that I’m not very good with Python environments. I tend to dump things in the base environment, forget which one I’m in, and all kinds of other bad practices. I think I’m secretly undermining myself in Python, but I don’t have enough proof yet.

Comments closed

Evolutionary Algorithms for Color Palette Discovery

Published 2020-06-08 by Kevin Feasel

Daniel Oehm combines two interests:

Colour theory is pretty complex stuff so choosing a good palette isn’t easy, let alone evolving one. So, you’re going to have some hits and some misses. This is definitely more for fun seeing what you discover rather than finding the perfect palette. Having said that you could discover some gold!
There are best practices when choosing a palette for data visualisation depending on the context and what is to be shown. For example people tend to respond to certain colours representing high / low, hot / cold or good / bad, there is also colourblindness considerations. evoPalette won’t necessarily adhere to these ideals.

I’d like to see a genetic algorithms approach, though you’d have to define some sort of function to score each outcome, so I can see how that’d be tricky. H/T R-Bloggers

Comments closed

Setting Up Your Own R Package Repository

Published 2020-05-26 by Kevin Feasel

Steve Belcher explains how to configure a custom package repository in your environment:

One of the strengths of the R language is the thousands of third-party packages that have been made publicly available via CRAN, the Comprehensive R Archive Network. R includes several functions that make it easy to download and install these packages. However, in many enterprise environments, access to the Internet is limited or non-existent. In such environments, it is useful to create a local package repository that users can access from within the corporate firewall.
Your local repository may contain source packages, binary packages, or both. If at least some of your users will be working on Windows systems, you should include Windows binaries in your repository. Windows binaries are R-version-specific; if you are running R 3.3.3, you need Windows binaries built under R 3.3. These versioned binaries are available from CRAN and other public repositories. If at least some of your users will be working on Linux systems, you must include source packages in your repository.

There are some tools which help out with this, so read the whole thing.

Comments closed

R 4.0 Improvements: stopifnot()

Published 2020-05-20 by Kevin Feasel

Bob Rudis looks at one of the R 4.0 changes hidden in the changelog:

R 4.0.0 has been out for a while, now, and — apart from a case where merge() was slower than dirt — it’s been really stable for at least me (I use it daily on macOS, Linux, and Windows). Sure, it came with some headline-grabbing features/upgrades, but I’ve started looking at what other useful nuggets might be in the changelog and decided to blog them as I find them.
Today’s nugget is the venerable stopifnot() function which was significantly enhanced by this PR by Neil Fultz.

Read on for a quality of life improvement with error handling in R.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31