Data Science – Page 13

Creating a Time Series in R

Published 2023-12-28 by Kevin Feasel

The ts() function in R is a fundamental tool for handling time series data. It takes four main arguments:

data: A vector or matrix of time series values.

start: The time of the first observation.

end: The time of the last observation.

frequency: The number of observations per unit of time.

Read on for an example of how this all works, as well as a function in the TidyDensity package to convert data into the R time series format.

Comments closed

Exploratory Data Analysis with F# and Plotly

Published 2023-12-26 by Kevin Feasel

Matt Eland is speaking my language (F#):

One of the most common tasks with data roles is the need to perform exploratory data analysis (EDA).

With EDA a data scientist, data analyst, or other data-oriented programmer can:

Understand the value distributions of their data

Identify outliers and data anomalies

Visualize correlations, trends, and relationships between multiple variables

Exploratory data analysis usually involves:

Loading the data into a DataFrame

Performing descriptive statistics to identify the raw shape of the data

Visualizing variables of interest on their own or with other variables.

In this article I’ll walk you through the process of loading data from a sample dataset into a Microsoft.Data.Analysis DataFrame (the kind featured in ML.NET). Next, we’ll look at the descriptive statistics the DataFrame class provides and then explore the process of creating some simple visualizations with Plotly.NET.

Read on for the scenario and analysis.

Comments closed

A Demonstration of Why Not to Z-Standardize Values for Logistic Regression

Published 2023-12-21 by Kevin Feasel

Sebastian Sauer takes us through a demo:

In this post, we’ll investigate the consequence of z-standardizing the predictor variables, and in addition the outcome variable in a simple logistic regression setting.

Do some coefficients change as a result of standardizing the values?

Click through for the example and what z-standardization does to the model.

Comments closed

An Introduction to Poisson Regression

Published 2023-12-20 by Kevin Feasel

Steven Sanderson talks about a discrete form of regression:

Hey data enthusiasts! Today, we’re diving into the fascinating world of count data and its trusty sidekick, Poisson regression. Buckle up, because we’re about to explore how this statistical powerhouse helps us understand the factors influencing, you guessed it, counts.

Scenario: Imagine you’re an education researcher, eager to understand how a student’s GPA might influence their job offer count after graduation. But hold on, job offers aren’t continuous – they’re discrete, ranging from 0 to a handful. That’s where Poisson regression comes in!

I have an unhealthy love for Poisson techniques, so I highly recommend checking this out.

Comments closed

Explaining Odds Ratios

Published 2023-12-19 by Kevin Feasel

Steven Sanderson explains the concept of an odds ratio:

Imagine a loan officer flipping a coin to decide whether to approve your loan. Odds ratios tell you how much more likely one factor (like your income) makes the “heads” (approval) side appear compared to another (like your student status).

In logistic regression, odds ratios compare the odds of an event (loan default, in our case) for two groups defined by a specific variable. They’re like multipliers: greater than 1 means something increases the chances of default, while less than 1 means it decreases them.

As for why, we use odds ratios because it’s hard to track and interpret changes in probabilities directly, at least when you’re thinking small numbers.

Comments closed

Fighting Heteroskedasticity in Regression Problems

Published 2023-12-18 by Kevin Feasel

Steven Sanderson deals with my favorite failure of BLUE (mainly because I love the name):

Tired of your least-squares regression model giving wonky results because some data points shout louder than others? Meet Weighted Least Squares (WLS), the superhero of regression, ready to tackle unequal variance (heteroscedasticity) and give your model the justice it deserves! Today, we’ll dive into the world of WLS in R, using base functions for maximum transparency. Buckle up, data warriors!

Read on to see how Weighted Least Squares helps in data analysis when you have heteroskedasticity.

Comments closed

ML Models and Data Warehouses in Microsoft Fabric

Published 2023-12-15 by Kevin Feasel

Tomaz Kastrun continues a series on Microsoft Fabric. First up is creating ML models:

Protip: Both experiments and the ML model version look similar, and you can intuitively switch between both of them. But do not get confused, as the ML Model version applies the best-selected model from the experiment and can be used for inference.

Then we switch context to data warehousing:

Today we will start exploring the Fabric Data Warehouse.

With the data lake-centric logic, the data warehouse in Fabric is built on a distributed processing engine, that enables automated scaling. The SaaS experience creates a segway to easier analysis and reporting, and at the same time gives the ability to run heavy workloads against open data format, simply by using transact SQL (T-SQL). Microsoft OneLake gives all the services to hold a single copy of data and can be consumed in a data warehouse, datalake or SQL Analytics.

Comments closed

Testing the Reproducibility of Random Numbers in STAN

Published 2023-12-13 by Kevin Feasel

Sebastian Sauer performs some tests:

Bayes models (using MCMC) build on drawing random numbers. By their very nature, random numbers are random. Unless they are not. As you may know, the random number fuctions in computers are purely deterministic.

However, in practice, some inpredictable behavior may still show up. The reason being simply that two computational environment must – in theory – being exactly identical in order to reproduce the same results. At least identical in every bit that influence random number generation.

Click through for the testbed and code.

Comments closed

Stepwise and Piecewise Regression in R

Published 2023-12-12 by Kevin Feasel

Steven Sanderson takes us through two regression techniques. First up is stepwise regression:

Stepwise regression is a powerful technique used to build predictive models by iteratively adding or removing variables based on statistical criteria. In R, this can be achieved using functions like step() or manually with forward and backward selection.

Piecewise regression follows:

Piecewise regression is a powerful technique that allows us to model distinct segments of a dataset with different linear relationships. It’s like fitting multiple straight lines to capture the nuances of different regions in your data. So, grab your virtual lab coat, and let’s get started.

Read on for explanations of both techniques, as well as some visuals and potential pitfalls you might run into along the way.

Comments closed

Spline Regression

Published 2023-12-08 by Kevin Feasel

Steven Sanderson performs a different form of regression:

Spline regression is particularly useful when the relationship between the independent and dependent variables is not adequately captured by a linear model. It involves fitting a piecewise continuous curve (spline) to the data. Let’s dive into the process using R.

Click through and you’ll be reticulating splines just like it’s Sim City 2000.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Data Science