Press "Enter" to skip to content

Category: R

Multivariate Histograms in R

Steven Sanderson wants multiple breakdowns:

Histograms are powerful tools for visualizing the distribution of a single variable, but what if you want to compare the distributions of two variables side by side? In this blog post, we’ll explore how to create a histogram of two variables in R, a popular programming language for data analysis and visualization.

We’ll cover various scenarios, from basic histograms to more advanced techniques, and explain the code step by step in simple terms. So, grab your favorite dataset or generate some random data, and let’s dive into the world of dual-variable histograms!

Click through for several techniques.

Comments closed

Multi-Plot Graphs in R

Steven Sanderson needs more than one line:

Data visualization is a crucial aspect of data analysis. In R, the flexibility and power of its plotting capabilities allow you to create compelling visualizations. One common scenario is the need to display multiple plots on the same graph. In this blog post, we’ll explore three different approaches to achieve this using the same dataset. We’ll use the set.seed(123) and generate data with x and y equal to cumsum(rnorm(25)) for consistency across examples.

Click through for three common techniques.

Comments closed

Reading Parquet Files with DuckDB and R

Michaël read a Parquet file:

Querying a remote parquet file via HTTP with DuckDB.

The french statistical service (INSEE) has made available its first parquet file on data.gouv.fr in June.

It’s a 470 MB file (from a 1.8 GB CSV) with 16·106 rows, showing for each address in France which polling station it belongs to.

Click through for the code and results. The only thing which surprised me at all was that the performance was so fast for a remote file, unless I’m misunderstanding something. For a local file, I’d expect 16 million rows to complete in under 2 seconds for heavy aggregation on two columns in Parquet. H/T R-Bloggers.

Comments closed

Plotting SVM Decision Boundaries in R

Steven Sanderson goes right up to the edge:

Support Vector Machines (SVM) are a powerful tool in the world of machine learning and classification. They excel in finding the optimal decision boundary between different classes of data. However, understanding and visualizing these decision boundaries can be a bit tricky. In this blog post, we’ll explore how to plot an SVM object using the e1071 library in R, making it easier to grasp the magic happening under the hood.

Read on to see how you can perform this analysis as well.

Comments closed

Plotting a Subset of Data in R

Steven Sanderson doesn’t need all of those data points:

Data visualization is a powerful tool for gaining insights from your data. In R, you have a plethora of libraries and functions at your disposal to create stunning and informative plots. One common task is to plot a subset of your data, which allows you to focus on specific aspects or trends within your dataset. In this blog post, we’ll explore various techniques to plot subsets of data in R, and I’ll explain each step in simple terms. Don’t worry if you’re new to R – by the end of this post, you’ll be equipped to create customized plots with ease!

Click through for several techniques for subsetting data, as well as reasons why you might want to do it.

Comments closed

Statistical Tests in R

Adrian Tam tries out a couple of tests:

R as a data analytics platform is expected to have a lot of support for various statistical tests. In this post, you are going to see how you can run statistical tests using the built-in functions in R. Specifically, you are going to learn:

  • What is t-test and how to do it in R
  • What is F-test and how to do it in R

This is one of the things that R does best among any language: statistical testing. R has support for an enormous number of statistical functions, either built into the base language or available as packages.

Comments closed

Finding Omitted Variables in Logistic Regression

John Mount picks up on a prior post:

For this note, let’s work out how to directly try and overcome the omitted variable bias by solving for the hidden or unobserved detailed data. We will work our example in R. We will derive some deep results out of a simple set-up. We show how to “un-marginalize” or “un-summarize” data.

This is an interesting dive into a common problem, and something which we can easily work around in linear regression, but not in logistic regression.

Comments closed

Building a Weierstrass Function in R

Tomaz Kastrun won’t let you take a derivative:

Coming from the simple sine function (remember of Fourier series), German mathematician Karl Weierstrass became the first to publish an example of a continuous, nowhere
differentiable function
. Weierstrass function (originally defined as a Fourier series) was the first instance in which the idea that a continuous function must be differentiable was introduced. This is an example of a fractal in a function (known as a fractal function) and also another of pathological functions (runs counter to some intuition).

Click through for an example of this in R.

Comments closed

Appropriate Uses of Jitter in Graphs

Steven Sanderson shakes things up:

As an R programmer, one of the most useful functions to know is the jitter function. The jitter function is used to add random noise to a numeric vector, which can be helpful when visualizing data in a scatterplot. By using the jitter function, we can get a better picture of the true underlying relationship between two variables in a dataset.

Read on to get an idea of how to use jitter, though I recommend making it very clear to chart viewers that you are, in fact, using jitter, as it can be easy to misinterpret the jitter as actual value locations.

Comments closed

Kernel Density Plots in R

Steven Sanderson explains one common type of plot in R:

Kernel Density Plots are a type of plot that displays the distribution of values in a dataset using one continuous curve. They are similar to histograms, but they are even better at displaying the shape of a distribution since they aren’t affected by the number of bins used in the histogram. In this blog post, we will discuss what Kernel Density Plots are in simple terms, what they are useful for, and show several examples using both base R and ggplot2.

Read on to learn more, including how to generate these in base R, ggplot2, and with the tidy_density package.

Comments closed