Category: R

Tidy Data Is Normalized Data

Published 2018-02-16 by Kevin Feasel

I emphasize the link between a tidy dataframe and a normalized data structure:

The kicker, as Wickham describes on pages 4-5, is that normalization is a critical part of tidying data. Specifically, Wickham argues that tidy data should achieve third normal form.

Now, in practice, Wickham argues, we tend to need to denormalize data because analytics tools prefer having everything connected together, but the way we denormalize still retains a fairly normal structure: we still treat observations and variables like we would in a normalized data structure, so we don’t try to pack multiple observations in the same row or multiple variables in the same column, reuse a column for multiple purposes, etc.

I had an inkling of this early on and figured I was onto something clever until I picked up Wickham’s vignette and read that yeah, that’s exactly the intent.

Comments closed

Radar Charts With ggplot2

Published 2018-02-15 by Kevin Feasel

I have wrapped up my ggplot2 series, with the last post being on radar charts:

First, we need to install ggradar and load our relevant libraries. Then, I create a quick standardization function which divides our variable by the max value of that variable in the vector. It doesn’t handle niceties like divide by 0, but we won’t have any zero values in our data frames.

The radar_data data frame starts out simple: build up some stats by continent. Then I call the mutate_each_ function to call standardize for each variable in the vars set. mutate_each_is deprecated and I should use something different like mutate_at, but this does work in the current version of ggplot2 at least.

Finally, I call the ggradar() function. This function has a large number of parameters, but the only one you absolutely need is plot.data. I decided to change the sizes because by default, it doesn’t display well at all on Windows.

It was a lot of fun putting this series together. I think the most important part of the series was learning just how easy ggplot2 is once you sit down and think about it in a systemic manner.

Comments closed

Creating Modal Dialogues In Shiny

Published 2018-02-14 by Kevin Feasel

Dean Attali announces a new shiny package:

shinyalert uses the sweetalert JavaScript library to create simple and elegant modals in Shiny. Modals can contain text, images, OK/Cancel buttons, an input to get a response from the user, and many more customizable options. A modal can also have a timer to close automatically.

Simply call shinyalert() with the desired arguments, such as a title and text, and a modal will show up. In order to be able to call shinyalert() in a Shiny app, you must first call useShinyalert() anywhere in the app’s UI.

It does look nice. Check out Dean’s GitHub repo for more information. H/T R-Bloggers

Comments closed

Visualizing Cholesterol Data With ggplot2

Published 2018-02-14 by Kevin Feasel

Anisa Dhana uses the National Health and Nutrition Examination Survey and visualizes results with ggplot2:

From the plots above I find that regardless the different levels of diastolic and systolic blood pressure there is no substantial correlation between cholesterol and blood pressure. However, it is better to build the correlation line with geom_smooth or to calculate the Spearman correlation, although in this post we focus only on the visualization.

Lets build the correlation line.

Click through for several examples of visuals.

Comments closed

Microsoft + R

Published 2018-02-14 by Kevin Feasel

David Smith points out a bunch of the ways that Microsoft integrates R into products:

You can call R from within some data oriented Microsoft products, and apply R functions (from base R, from packages, or R functions you’ve written) to the data they contain.

SQL Server (the database) allows you to call R from SQL, or publish R functions to a SQL Server for database adminstrators to use from SQL.
Power BI (the reporting and visualization tool) allows you to call R functions to process data, create graphics, or apply statistical models to data.
Visual Studio (the integrated development environment) includes R as a fully-supported language with syntax highlighting, debugging, etc.
R is supported in various cloud-based services in Azure, including the Data Science Virtual Machine and Azure Machine Learning Studio. You can also publish R functions to Azure with the AzureML package, and then call those R functions from applications like Excel or apps you write yourself.

They’re pretty well invested in both R and Python, which is a good thing.

Comments closed

Using cowplot With ggplot2

Published 2018-02-14 by Kevin Feasel

I have a post on extending ggplot2’s functionality with cowplot:

Notice that I used geom_path(). This is a geom I did not cover earlier in the series. It’s not a common geom, though it does show up in charts like this where we want to display data for three variables. The geom_line() geom follows the basic rules for a line: that the variable on the y axis is a function of the variable on the x axis, which means that for each element of the domain, there is one and only one corresponding element of the range (and I have a middle school algebra teacher who would be very happy right now that I still remember the definition she drilled into our heads all those years ago).

But when you have two variables which change over time, there’s no guarantee that this will be the case, and that’s where geom_path() comes in. The geom_path() geom does not plot y based on sequential x values, but instead plots values according to a third variable. The trick is, though, that we don’t define this third variable—it’s implicit in the data set order. In our case, our data frame comes in ordered by year, but we could decide to order by, for example, life expectancy by setting data = arrange(global_avg, m_lifeExp). Note that in a scenario like these global numbers, geom_line() and geom_path() produce the same output because we’ve seen consistent improvements in both GDP per capita and life expectancy over the 55-year data set. So let’s look at a place where that’s not true.

The cowplot library gives you an easier way of linking together different plots of different sizes in a couple lines of code, which is much easier than using ggplot2 by itself.

Comments closed

Simple Data Transformation Tricks In R

Published 2018-02-13 by Kevin Feasel

Abdul Majed Raja has a few tidyverse-friendly data transformation tips:

Splitting a column to many columns is a cliched Data Transformation case that’s hardly unseen while performing Data Transformation. While it’s straightforward to do this in Microsoft Excel, it’s slightly tricky using Data analytics languages. That is true until this function separate() from tidyr came.

These are small but helpful tips.

Comments closed

Fun With Random Walks

Published 2018-02-12 by Kevin Feasel

Emrah Mete simulates a random walk in R:

Let’s consider a game where a gambler is likely to win $1 with a probability of p and lose $1 with a probability of 1-p.

Now, let’s consider a game where a gambler is likely to win $1 and lose $1 with a probability of 1. The player starts the game with X dollars in hand. The player plays the game until the money in his hand reaches N (N> X) or he has no money left. What is the probability that the player will reach the target value? (We know that the player will not leave the game until he reaches the N value he wants to win.)

The problem of the story above is known in literature as Gambler’s Ruin or Random Walk. In this article, I will simulate this problem with R with different settings and examine how the game results change with different settings.

Click through for the script and analysis. There’s a reason they call this game the gambler’s ruin.

Comments closed

DataExplorer

Published 2018-02-12 by Kevin Feasel

Boxuan Cui introduces DataExplorer, an R package dedicated to assist with exploratory data analysis:

According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyable data science task. Of all the resources out there, DataExplorer is one of them, with its sole mission to minimize the 80%, and make it enjoyable. As a result, one fundamental design principle is to be extremely user-friendly. Most of the time, one function call is all you need.

Data manipulation is powered by data.table, so tasks involving big datasets usually complete in a few seconds. In addition, the package is flexible enough with input data classes, so you should be able to throw in any data.frame-like objects. However, certain functions require a data.table class object as input due to the update-by-reference feature, which I will cover in later part of the post.

For my money, that number is closer to 90%. I will have to check this package out.

Comments closed

Using Power BI To Pass Parameterized Values To R

Published 2018-02-12 by Kevin Feasel

Stacia Varga shows how you can parameterize your R scripts within Power BI:

It’s not difficult, but the cool thing about Power BI is that I can use parameters to dynamically change the report visualization without opening up the script. To do this:

Open the Query Editor in Power BI
Click Manage Parameters, and then click New Parameter.
Set the parameter properties – Name, Type, and Current Value.The Name is how I will reference the parameter my R script, the Type is the data type, and Current Value is the initial value that I want to set (if any).

Click through for an example and more details.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31