Press "Enter" to skip to content

Category: R

Launching A Sparklyr Cluster

David Smith shows how to launch a sparklyr cluster in Azure:

When you’re finished, shut down your cluster using the aztk spark cluster delete command. (While you can delete the nodes from the Pools view in the Azure portal, the command does some additional cleanup for you.) You’ll be charged for each node in the cluster at the usual VM rates for as long as the cluster is provisioned. (One cost-saving option is to use low-priority VMs for the nodes, for savings of up to 90% compared to the usual rates.)

That’s it! Once you get used to it, it’s all quick and easy — the longest part is waiting for the cluster to spin up in Step 5. This is just a summary, but the full details see the guide SparklyR on Azure with AZTK.

It’ll take a bit more than five minutes to get started, but it is a good sight easier than building the servers yourself.

Comments closed

Web Analytics With R

Maelle Salmon performs some analysis on the Locke Data blog:

Often, the URL of a blog post can be guessed based on its title, e.g. this one can be read here. But even if the transition from the Markdown file information to an URL is logical, it was best to get URLs from the in situ blog posts, and then join them to the blog post information collected previously, since some special characters got special treatment that I could not fully understand by looking at blogdown source code.

I first extracted all posts URLs from the website map.

Check it out.

Comments closed

Using stringr To Remove HTML

I have a quick post on removing HTML markup with stringr:

This is a quick post today on removing HTML tags using the stringr package in R.

My purpose here is in taking some raw data, which can include HTML markup, and preparing it for a vectorizer.  I don’t need the resulting output to look pretty; I just want to get rid of the HTML characters.

Click through for the script.  If you need to do something nice with the text afterward, my technique is probably too much sledgehammer for niceties, but it does the trick for pre-processing before vectorization.

Comments closed

Loops Versus Apply: Speed Comparison

Mike Spencer compares lapply (single core and its multi-core version) versus a for loop in R:

But how fast were they? Can we get faster? Thankfully R provides `system.time()` for timing code execution. In order to get faster, it makes sense to use all the processing power our machines have. The ‘parallel’ library has some great tools to help us run our jobs in parallel and take advantage of multicore processing. My favourite is `mclapply()`, because it is very very easy to take an `lapply` and make it multicore. Note that mclapply doesn’t work on Windows. The following script runs the `read_clean_write()` function in a for loop (boo, hiss), lapply and mclapply. I’ve run these as list elements to make life easier later on.

It’s interesting reading, particularly because I had expected lapply to do a little bit better.  Also interesting is the relative overhead cost of mclapply in this scenario:  going from 1 core to 4 cut the time to approximately 1/3, not 1/4.

Comments closed

The Basics Of PCA In R

Prashant Shekhar gives us an overview of Principal Component Analysis using R:

PCA changes the axis towards the direction of maximum variance and then takes projection on this new axis. The direction of maximum variance is represented by Principal Components (PC1). There are multiple principal components depending on the number of dimensions (features) in the dataset and they are orthogonal to each other. The maximum number of principal component is same as a number of dimension of data. For example, in the above figure, for two-dimension data, there will be max of two principal components (PC1 & PC2). The first principal component defines the most of the variance, followed by second principal component, third principal component and so on. Dimension reduction comes from the fact that it is possible to discard last few principal components as they will not capture much variance in the data.

PCA is a useful technique for reducing dimensionality and removing covariance.

Comments closed

Tidy Data Is Normalized Data

I emphasize the link between a tidy dataframe and a normalized data structure:

The kicker, as Wickham describes on pages 4-5, is that normalization is a critical part of tidying data.  Specifically, Wickham argues that tidy data should achieve third normal form.

Now, in practice, Wickham argues, we tend to need to denormalize data because analytics tools prefer having everything connected together, but the way we denormalize still retains a fairly normal structure:  we still treat observations and variables like we would in a normalized data structure, so we don’t try to pack multiple observations in the same row or multiple variables in the same column, reuse a column for multiple purposes, etc.

I had an inkling of this early on and figured I was onto something clever until I picked up Wickham’s vignette and read that yeah, that’s exactly the intent.

Comments closed

Radar Charts With ggplot2

I have wrapped up my ggplot2 series, with the last post being on radar charts:

First, we need to install ggradar and load our relevant libraries. Then, I create a quick standardization function which divides our variable by the max value of that variable in the vector. It doesn’t handle niceties like divide by 0, but we won’t have any zero values in our data frames.

The radar_data data frame starts out simple: build up some stats by continent. Then I call the mutate_each_ function to call standardize for each variable in the vars set. mutate_each_is deprecated and I should use something different like mutate_at, but this does work in the current version of ggplot2 at least.

Finally, I call the ggradar() function. This function has a large number of parameters, but the only one you absolutely need is plot.data. I decided to change the sizes because by default, it doesn’t display well at all on Windows.

It was a lot of fun putting this series together. I think the most important part of the series was learning just how easy ggplot2 is once you sit down and think about it in a systemic manner.

Comments closed

Creating Modal Dialogues In Shiny

Dean Attali announces a new shiny package:

shinyalert uses the sweetalert JavaScript library to create simple and elegant modals in Shiny. Modals can contain text, images, OK/Cancel buttons, an input to get a response from the user, and many more customizable options. A modal can also have a timer to close automatically.

Simply call shinyalert() with the desired arguments, such as a title and text, and a modal will show up. In order to be able to call shinyalert() in a Shiny app, you must first call useShinyalert() anywhere in the app’s UI.

It does look nice.  Check out Dean’s GitHub repo for more information. H/T R-Bloggers

Comments closed

Visualizing Cholesterol Data With ggplot2

Anisa Dhana uses the National Health and Nutrition Examination Survey and visualizes results with ggplot2:

From the plots above I find that regardless the different levels of diastolic and systolic blood pressure there is no substantial correlation between cholesterol and blood pressure. However, it is better to build the correlation line with geom_smooth or to calculate the Spearman correlation, although in this post we focus only on the visualization.

Lets build the correlation line.

Click through for several examples of visuals.

Comments closed