Press "Enter" to skip to content

Category: Data Science

Quantile Regression using Random Forests

Norm Matloff answers a reader question:

In my December 22 blog, I first introduced the classic parametric quantile regression (QR) concept. I then showed how one could use the qeML package to perform quantile regression nonparametrically, using the package’s qeKNN function for a k-Nearest Neighbors approach. A reader then asked if this could be applied to random forests (RFs). The answer is yes, and this will be the topic of the current post.

Read on to learn more about how to do this, including some of the challenges you’ll face along the way. H/T R-Bloggers.

Comments closed

Reversion to the Mean

Holger von Jouanne-Diedrich explains an important statistical concept we all too often forget:

In the realm of business and leadership, one statistical phenomenon often goes unrecognized yet significantly influences our understanding of performance and success. This is the concept of reversion to the mean (also called regression to the mean). This seemingly simple statistical occurrence can profoundly impact how we perceive management strategies, leadership effectiveness, and even the fate of those gracing the covers of prominent magazines. To understand what is going on, read on!

Read on for a video in German and an article in English, with some bonus R code to sell the story.

Comments closed

Notes on Linear Markov Chains

John Mount has some thoughts for us:

I want to collect some “great things to know about linear Markov chains.”

For this note we are working with a Markov chain on states that are the integers 0 through k (k > 0). A Markov chain is an iterative random process with time tracked as an increasing integer t, and the next state of the chain depending only on the current (soon to be previous) state. For our linear Markov chain the only possible next states from state i are: i (called a “self loop” when present), i+1 (called up or right), and i-1 (called down or left). In no case does the chain progress below 0 or above k.

Click through for notes on two variants of this sort of linear Markov chain, as well as a pair of appendices containing derivation notes and Python code.

Comments closed

Plotting Time Series in R

Steven Sanderson builds some charts:

Our Flight Plan:

  1. Loading Up with Data: Grabbing our trusty dataset, AirPassengers.
  2. Taking Off with Base R: Creating a basic time series plot using base R functions.
  3. Soaring with ggplot2: Crafting a visually stunning time series plot using the ggplot2 library.
  4. Navigating Date Formatting: Customizing axis labels with scale_x_date() for clarity.
  5. Landing with Your Own Exploration: Encouraging you to take the controls and create your own time series plots!

Click through to see each of these steps in action.

Comments closed

Creating a Time Series in R

Steven Sanderson says it’s time:

The ts() function in R is a fundamental tool for handling time series data. It takes four main arguments:

  1. data: A vector or matrix of time series values.
  2. start: The time of the first observation.
  3. end: The time of the last observation.
  4. frequency: The number of observations per unit of time.

Read on for an example of how this all works, as well as a function in the TidyDensity package to convert data into the R time series format.

Comments closed

Exploratory Data Analysis with F# and Plotly

Matt Eland is speaking my language (F#):

One of the most common tasks with data roles is the need to perform exploratory data analysis (EDA).

With EDA a data scientist, data analyst, or other data-oriented programmer can:

  • Understand the value distributions of their data
  • Identify outliers and data anomalies
  • Visualize correlations, trends, and relationships between multiple variables

Exploratory data analysis usually involves:

  1. Loading the data into a DataFrame
  2. Performing descriptive statistics to identify the raw shape of the data
  3. Visualizing variables of interest on their own or with other variables.

In this article I’ll walk you through the process of loading data from a sample dataset into a Microsoft.Data.Analysis DataFrame (the kind featured in ML.NET). Next, we’ll look at the descriptive statistics the DataFrame class provides and then explore the process of creating some simple visualizations with Plotly.NET.

Read on for the scenario and analysis.

Comments closed

An Introduction to Poisson Regression

Steven Sanderson talks about a discrete form of regression:

Hey data enthusiasts! Today, we’re diving into the fascinating world of count data and its trusty sidekick, Poisson regression. Buckle up, because we’re about to explore how this statistical powerhouse helps us understand the factors influencing, you guessed it, counts.

Scenario: Imagine you’re an education researcher, eager to understand how a student’s GPA might influence their job offer count after graduation. But hold on, job offers aren’t continuous – they’re discrete, ranging from 0 to a handful. That’s where Poisson regression comes in!

I have an unhealthy love for Poisson techniques, so I highly recommend checking this out.

Comments closed

Explaining Odds Ratios

Steven Sanderson explains the concept of an odds ratio:

Imagine a loan officer flipping a coin to decide whether to approve your loan. Odds ratios tell you how much more likely one factor (like your income) makes the “heads” (approval) side appear compared to another (like your student status).

In logistic regression, odds ratios compare the odds of an event (loan default, in our case) for two groups defined by a specific variable. They’re like multipliers: greater than 1 means something increases the chances of default, while less than 1 means it decreases them.

As for why, we use odds ratios because it’s hard to track and interpret changes in probabilities directly, at least when you’re thinking small numbers.

Comments closed

Fighting Heteroskedasticity in Regression Problems

Steven Sanderson deals with my favorite failure of BLUE (mainly because I love the name):

Tired of your least-squares regression model giving wonky results because some data points shout louder than others? Meet Weighted Least Squares (WLS), the superhero of regression, ready to tackle unequal variance (heteroscedasticity) and give your model the justice it deserves! Today, we’ll dive into the world of WLS in R, using base functions for maximum transparency. Buckle up, data warriors!

Read on to see how Weighted Least Squares helps in data analysis when you have heteroskedasticity.

Comments closed