Press "Enter" to skip to content

Category: Data Science

Plotting The Effects of Noise on R^2

Tomaz Kastrun messes with R^2:

So, an R-squared of 0.59 might show how well the data fit to the model (hence goodness of fit) and also explains about 59% of the variation in our dependent variable.

Given this logic, we prefer our regression models to have a high R-squared. Yes? Right! And by useless test, with adding random noise to a function, what happens next?

I like Tomaz’s scenario here and think he does a good job demonstrating the outcome. I do, however, struggle with the characterization of “making R^2 useless.” When the error term approaches an enormous value relative to the regressable components, that R^2 is telling you that something else is dominating the relationship between the independent variables and dependent variable. And this is correct: that error term does dominate. I suppose the problem here is philosophical: we call it an error term but what it signifies is “information we don’t understand about the relationship between these variables.” Yes, in this toy example, it was randomly-generated noise. But in a real dataset, it’s not random; it’s inexplicable, at least given the information you know at that time and the mechanisms you use to analyze the relationship.

Comments closed

Power Regression in R

Steven Sanderson’s power level is over 9000:

In the realm of statistics, power regression stands out as a versatile tool for exploring the relationship between two variables, where one variable is the power of the other. This type of regression is particularly useful when there’s an inherent nonlinear relationship between the variables, often characterized by an exponential or inverse relationship.

Read on to learn more about the definition of power regression and how to perform it in R using a technique called “swole linear regression.” Or at least that’s what I think the technique should be called. Which is probably why I’m not in charge of naming things.

Comments closed

Operating on Time Series Data in R

Dario Radečić understands that time is a flat circle:

If there’s one type of data no company has a shortage of, it has to be time series data. Yet, many beginner and intermediate R developers struggle to grasp their heads around basic R time series concepts, such as manipulating datetime values, visualizing time data over time, and handling missing date values.

Lucky for you, that will all be a thing of the past in a couple of minutes. This article brings you the basic introduction to the world of R time series analysis. We’ll cover many concepts, from key characteristics of time series datasets, loading such data in R, visualizing it, and even doing some basic operations such as smoothing the curve and visualizing a trendline.

We have a lot of work to do, so let’s jump straight in!

Click through for a high-level overview. H/T R-Bloggers.

Comments closed

Local Regression (LOESS) in R

Steven Sanderson takes us through a powerful regression technique:

LOESS, which stands for LOcal regrESSion, is a versatile and powerful technique for fitting a curve to a set of data points. Unlike traditional linear regression, LOESS adapts to the local behavior of the data, making it perfect for capturing intricate patterns in noisy datasets.

Click through for examples. LOESS works best with quadratic data, like in Steven’s last example image. The downside to it as a technique is that you can find spurious movement that may seem interesting but is just following the noise.

Comments closed

Exponential Regression in R

Steven Sanderson understands the power of compound interest:

Before we jump into the code, let’s quickly grasp the concept of exponential regression. In simple terms, it’s a statistical method used to model relationships where the rate of change of a variable is proportional to its current state. Think of scenarios like population growth, viral spread, or even financial investments.

Read on to see how you can perform a regression in this kind of scenario.

Comments closed

Quadratic Regression in R

Steven Sanderson needs more than a line:

In the realm of data analysis, quadratic regression emerges as a powerful tool for uncovering the hidden patterns within datasets that exhibit non-linear relationships. Unlike its linear counterpart, quadratic regression ventures beyond straight lines, gracefully capturing curved relationships between variables. This makes it an essential technique for understanding a wide range of phenomena, from predicting stock prices to modeling population growth.

Embark on a journey into the world of quadratic regression using the versatile R programming language. We’ll explore the steps involved in fitting a quadratic model, interpreting its parameters, and visualizing the results. Along the way, you’ll gain hands-on experience with this valuable technique, enabling you to tackle your own data analysis challenges with confidence.

Read on to see how you can model a quadratic relationship between one independent variable (or multiple independent variables) and the dependent variable in lm().

Comments closed

New Features in healthyR.ts 0.3

Steven Sanderson lays out some updates:

One of the standout additions is the introduction of util_log_ts(). This function seems like a game-changer, providing a streamlined way to log time series data. This is incredibly useful, especially when dealing with extensive datasets, making the whole process more efficient and user-friendly. This is a helper function for auto_stationarize().

There’s a lot in this update and the blog post also includes several examples of automating stationarity and ARIMA.

Comments closed

Creating Prediction Intervals in R

Steven Sanderson builds a prediction interval:

Prediction intervals are a powerful tool for understanding the uncertainty of your predictions. They allow you to specify a range of values within which you are confident that the true value will fall. This can be useful for many tasks, such as setting realistic goals, making informed decisions, and communicating your findings to others.

In this blog post, we will show you how to create a prediction interval in R using the mtcars dataset. The mtcars dataset is a built-in dataset in R that contains information about fuel economy, weight, displacement, and other characteristics of 32 cars.

Click through to see an example based on linear regression.

Comments closed

Comparing Permutation SHAP and Kernel SHAP

Michael Mayer lays some groundwork:

SHAP is the predominant way to interpret black-box ML models, especially for tree-based models with the blazingly fast TreeSHAP algorithm.

For general models, two slower SHAP algorithms exist:

  1. Permutation SHAP (Štrumbelj and Kononenko, 2010)
  2. Kernel SHAP (Lundberg and Lee, 2017)

Read on to understand more about these two forms of SHAP, as well as how they compare in two datasets of differing levels of difficulty.

Comments closed