Press "Enter" to skip to content

Category: R

Data Exploration in R with dplyr

Adrian Tam continues a series on R:

When you are working on a data science project, the data is often tabular structured. You can use the built-in data table to handle such data in R. You can also use the famous library dplyr instead to benefit from its rich toolset. In this post, you will learn how dplyr can help you explore and manipulate tabular data. In particular, you will learn:

  • How to handle a data frame
  • How to perform some common operations on a data frame

I like dplyr a lot for its “functional flow”—you pipe outputs of one function to be inputs of the next function, so the chain makes a lot of sense. If you want high performance, though, it’s often not the best choice—that’s usually data.table.

Comments closed

ggplot2 in Python Notebooks

John Mount runs R in Python with rpy2:

For an article on A/B testing that I am preparing, I asked my partner Dr. Nina Zumel if she could do me a favor and write some code to produce the diagrams. She prepared an excellent parameterized diagram generator. However being the author of the book Practical Data Science with R, she built it in R using ggplot2. This would be great, except the A/B testing article is being developed in Python, as it targets programmers familiar with Python.

As the production of the diagrams is not part of the proposed article, I decided to use the rpy2 package to integrate the R diagrams directly into the new worksheet. Alternatively, I could translate her code into Python using one of: Seaborn objectsplotnineggpy, or others. The large number of options is evidence of how influential Leland Wilkinson’s grammar of graphics (gg) is.

Click through to see how you can execute R code within the context of Python, similar to how you can use the reticulate package to execute Python code in the context of R.

Comments closed

Pairs Plots in Base R

Steven Sanderson shows how we can create a pairs plot using the pairs() function in R:

A pairs plot, also known as a scatterplot matrix, is a grid of scatterplots that displays pairwise relationships between multiple variables in a dataset. Each cell in the grid represents the relationship between two variables, and the diagonal cells display histograms or kernel density plots of individual variables. Pairs plots are incredibly versatile, helping us to identify patterns, correlations, and potential outliers in our data.

Click through for one example, how to interpret it, and how to customize the outputs.

Comments closed

Creating Confidence Intervals on a Linear Model in R

Steven Sanderson goes frequentist on us:

Linear regression is a fundamental statistical technique used to model the relationship between a dependent variable and one or more independent variables. While fitting a linear model is relatively straightforward in R, it’s also essential to understand the uncertainty associated with our model’s predictions. One way to visualize this uncertainty is by creating confidence intervals around the regression line. In this blog post, we’ll walk through how to perform linear regression and plot confidence intervals using base R with the popular Iris dataset.

Click through to see how, even if you’re a Bayesian who considers confidence intervals to overstate precision in reality.

Comments closed

Grouped Scatter Plots in R

Steven Sanderson builds a scatter plot:

Data visualization is a powerful tool for gaining insights from your data. Scatter plots, in particular, are excellent for visualizing relationships between two continuous variables. But what if you want to compare multiple groups within your data? In this blog post, we’ll explore how to create engaging scatter plots by group in R. We’ll walk through the process step by step, providing several examples and explaining the code blocks in simple terms. So, whether you’re a data scientist, analyst, or just curious about R, let’s dive in and discover how to make your data come to life!

Click through for several examples of plot generation.

Comments closed

Working with Histogram Breaks in R

Steven Sanderson divvies out buckets for a histogram:

Histograms divide data into bins, or intervals, and then count how many data points fall into each bin. The breaks parameter in R allows you to control how these bins are defined. By specifying breaks thoughtfully, you can highlight specific patterns and nuances in your data.

Click through to see how you can use the breaks parameter in a few different ways to customize your histogram. The default breaks in R are often reasonable, but trying a few different breaks can help you get a better understanding of the actual distribution of the data.

Comments closed

An Introduction to R Markdown

Adrian Tam continues a series on R:

One reason people would like to use RStudio for their work is because of the R Markdown. This made the RStudio not only an IDE for programming in R, but also a notepad in which they could put down their thoughts with R code and results. In this post, you will learn how to use R Markdown. Specifically, you will learn

  • What is Markdown
  • How to use Markdown to create a technical document in RStudio

Click through to learn more. I’d also suggest diving into the docs for knitr.

Comments closed

Multivariate Histograms in R

Steven Sanderson wants multiple breakdowns:

Histograms are powerful tools for visualizing the distribution of a single variable, but what if you want to compare the distributions of two variables side by side? In this blog post, we’ll explore how to create a histogram of two variables in R, a popular programming language for data analysis and visualization.

We’ll cover various scenarios, from basic histograms to more advanced techniques, and explain the code step by step in simple terms. So, grab your favorite dataset or generate some random data, and let’s dive into the world of dual-variable histograms!

Click through for several techniques.

Comments closed

Multi-Plot Graphs in R

Steven Sanderson needs more than one line:

Data visualization is a crucial aspect of data analysis. In R, the flexibility and power of its plotting capabilities allow you to create compelling visualizations. One common scenario is the need to display multiple plots on the same graph. In this blog post, we’ll explore three different approaches to achieve this using the same dataset. We’ll use the set.seed(123) and generate data with x and y equal to cumsum(rnorm(25)) for consistency across examples.

Click through for three common techniques.

Comments closed

Reading Parquet Files with DuckDB and R

Michaël read a Parquet file:

Querying a remote parquet file via HTTP with DuckDB.

The french statistical service (INSEE) has made available its first parquet file on data.gouv.fr in June.

It’s a 470 MB file (from a 1.8 GB CSV) with 16·106 rows, showing for each address in France which polling station it belongs to.

Click through for the code and results. The only thing which surprised me at all was that the performance was so fast for a remote file, unless I’m misunderstanding something. For a local file, I’d expect 16 million rows to complete in under 2 seconds for heavy aggregation on two columns in Parquet. H/T R-Bloggers.

Comments closed