Press "Enter" to skip to content

Category: R

ML Model Interactions and hstats

Michael Mayer has a new R package for us:

This post is mainly about the third approach. Its beauty is that we get information about all interactions. The downside: it is as good/bad as partial dependence functions. And: the statistics are computationally very expensive to compute (of order n^2).

Different R packages offer some of these H-statistics, including {iml}, {gbm}, {flashlight}, and {vivid}. They all have their limitations. This is why I wrote the new R package {hstats}:

Click through for an overview of the package and an example of how it works.

Comments closed

Set Intersection of Vectors in R

Steven Sanderson performs set intersection of vectors:

Welcome to another exciting blog post where we delve into the world of R programming. Today, we’ll be discussing the intersect() function, a handy tool that helps us find the common elements shared between two or more vectors in R. Whether you’re a seasoned R programmer or just starting your journey, this function is sure to become a valuable addition to your toolkit.

Click through to see how the function works.

Comments closed

Cumulative Means in R

Steven Sanderson performs a moving average:

The cumulative mean, also known as the running mean or moving average, provides us with a dynamic view of how the average value of a dataset changes as new observations are added incrementally. It is an invaluable tool in time-series analysis, trend identification, and smoothing noisy data.

Imagine you have a series of numeric values, and you want to find the average of the first observation, then the average of the first two observations, followed by the average of the first three, and so on. This iterative process generates the cumulative mean, painting a picture of how the data behaves over time.

Often times, we care about the moving average over a specific window, such as the last n periods. This particular post covers the moving average over the entire set of data.

Comments closed

Percentage by Group in R

Steven Sanderson performs a breakdown:

Calculating percentages by group is a common task in data analysis. It allows you to understand the distribution of data within different categories. In this blog post, we’ll walk you through the process of calculating percentages by group using three popular R packages: Base R, dplyr, and data.table. To keep things simple, we will use the well-known Iris dataset.

The Iris dataset contains information about different species of iris flowers and their measurements, including sepal length, sepal width, petal length, and petal width. We will focus on the ‘Species’ column and calculate the percentage of each species in the dataset.

Read on for the three approaches. I think the Tidyverse approach is the easiest to understand in this case, though all three get you to the answer.

Comments closed

Subsetting List Objects in R

Steven Sanderson makes a sub-list and checks it twice:

If you’re an aspiring data scientist or R programmer, you must be familiar with the powerful data structure called “lists.” Lists in R are collections of elements that can contain various data types such as vectors, matrices, data frames, or even other lists. They offer great flexibility and are widely used in many real-world scenarios.

In this blog post, we will explore one of the essential skills in working with lists: subsetting. Subsetting allows you to extract specific elements or portions of a list, helping you access and manipulate data efficiently. So, let’s dive into the world of list subsetting and learn some useful techniques along the way!

Read on for multiple ways of subsetting lists in base R.

Comments closed

Finding Duplicate Rows and Values in R

Steven Sanderson de-duplicates, starting with values:

In data analysis and programming, it’s common to encounter situations where you need to identify duplicate values within a dataset. Whether you’re a beginner or an experienced programmer, knowing how to find duplicate values is a fundamental skill. In this blog post, we will explore two different approaches to accomplish this task using base R functions and the dplyr package in R. By the end, you’ll have a clear understanding of how to detect and manage duplicate values in your own datasets.

From there, we get to see various ways to de-duplicate rows in R:

In data analysis and manipulation tasks, it’s common to encounter situations where we need to identify and handle duplicate rows in a dataset. In this blog post, we will explore three different approaches to finding duplicate rows in R: the base R method, the dplyr package, and the data.table package. We’ll compare their performance using the benchmark function and provide insights on when to use each approach. So, grab your coding gear, and let’s dive in!

Duplicate values is a relatively tricky one, with rows being much easier.

Comments closed

Modularizing an Existing Shiny App

Peter Baranovskiy breaks it down:

There are multiple tutorials available online on writing modular Shiny apps. So why one more? Well, when I just started with building modular apps myself, these didn’t do much for me. So I really only learned how to write modules when I had an opportunity to team up with an experienced R Shiny developer. The reason I guess is that Shiny modules is an advanced topic, and you typically get to writing modules only when you finally need to scale your apps – and keep opportunities for further scaling open. This typically means when your app goes into production. By then you probably have already developed multiple apps, and switching over to a way of thinking required to write modules may be challenging. If you don’t know what modules are, I recommend starting here and then coming back to this post. Otherwise, read on.

So, I decided to try a different approach and instead of building a simple modular app from scratch, to go in the opposite direction by breaking down a complex real-life app into modules. Here’s the app’s original non-modular code. Note a single app.R file that contains the entire app. static_assets.R includes some object definitions which I moved to a separate file for convenience. calgary_crime_data_prep.R is not part of the app; it is a data retrieval and cleaning script executed once a month with cron. Running the script each time the app launches would have made it extremely slow and would use way too much bandwidth, as the script downloads and processes 150+ Mb of data on each run.

Read on for the reasoning behind using modules, as well as Peter’s notes on the process.

Comments closed

Object Comparison in R

Steven Sanderson checks two objects:

In the realm of programming, R is a widely-used language for statistical computing and data analysis. Within R, there exists a powerful function called identical() that allows programmers to compare objects for exact equality. In this blog post, we will delve into the syntax and usage of the identical() function, providing clear explanations and practical examples along the way.

You can also take a look at the documentation for this function to see a few more examples.

Comments closed

Creating an HTTP Header Hash in R

Bob Rudis creates an R package:

HTTP Headers Hashing (HHHash) is a technique developed by Alexandre Dulaunoy to generate a fingerprint of an HTTP server based on the headers it returns. It employs one-way hashing to generate a hash value from the list of header keys returned by the server. The HHHash value is calculated by concatenating the list of headers returned, ordered by sequence, with each header value separated by a colon. The SHA256 of this concatenated list is then taken to generate the HHHash value. HHHash incorporates a version identifier to enable updates to new hashing functions.

Read on to see when it might be useful and other things you should know about the package. H/T R-Bloggers.

Comments closed