Data Science – Page 30

Time Series Estimation with Facebook’s Prophet

Published 2021-07-12 by Kevin Feasel

Dan Lantos looks at the Prophet library:

This article (part of a short series) aims to introduce the Prophet library, discuss it at a high level and run through a basic example of forecasting the FTSE 100 index. Future articles will discuss exactly how Prophet achieves its results, how to interpret the output and how to improve the model.
Please see this article (by my talented colleague Gavita) for an introduction to time-series forecasting algorithms.

Click through for part one in an ongoing series.

Comments closed

Reasons to Use Tidymodels

Published 2021-07-08 by Kevin Feasel

Roel Hogervorst explains when we may or may not want to use tidymodels versus rolling our own models in R:

When not
– you are always using GLM models. (they are very flexible!) it makes no sense to me to go for the extra {parsnip} layer if you are always using the same models. You could still consider using recipes to feature engineer.
– If you are familiar with the kind of data and what models will work on that data. Basically you are an expert on this field and have worked on it for many years. There is no need to experiment.

Read on for concrete examples of when it does make sense. H/T R-Bloggers.

Comments closed

A Summary of Time Series Algorithms

Published 2021-06-25 by Kevin Feasel

Gavita Regunath and Dan Lantos give an overview of time series algorithms:

Time series forecasting is a data science task that is critical to a variety of activities within any business organisation. Time series forecasting is a useful tool that can help to understand how historical data influences the future. This is done by looking at past data, defining the patterns, and producing short or long-term predictions.

Click through for an overview, as well as ten examples of algorithms you can use for handling time series data.

Comments closed

Decile Analysis and Logistic Regression

Published 2021-06-21 by Kevin Feasel

Ridhima Kumar (re-)introduces us to decile analysis:

Decile analysis was once a popularly used technique, however the convention of teaching and bucketing machine learning problems into either ‘classification’ or ‘Regression’ types, lead people to forget Decile analysis type analyses. I am pretty sure, most freshly minted data scientists would not have even heard of Decile analysis. So, coming back to what is Decile Analysis.
Decile Analysis is used to categorize dataset from highest to lowest values or vice versa. (Based on predicted probabilities)
As obvious from the name, the analysis involves dividing the dataset into ten equal groups. Each group should have the same no. of observations/customers.
It ranks customers in the order from most likely to respond to least likely to respond.

Read on to learn the steps and how this ties with the fact that logistic regression is regression.

Comments closed

Building QQ plots in R

Published 2021-06-16 by Kevin Feasel

The folks at finnstats explain the notion of a Quantile-Quantile plot and show how to create one in R:

QQ-plots in R, first need to understand the Q-Q plot. The Q-Q plot is a graphical tool to help us examine if a set of data plausibly came from some theoretical distribution such as a Normal or not.
Suppose, if we are executing a statistical analysis the test comes under parametric methods assumes variable is Normally distributed, we can make use of a Q-Q plot to check that assumption.
It’s just a visual verification, not full proof, so we can make use of some other statistical test also. But Q-Qplot allows us to see at-a-glance if our assumption is valid or not.

Click through to learn more. H/T R-bloggers.

Comments closed

Plotting Correlation Analyses in R

Published 2021-05-17 by Kevin Feasel

Finnstats shows a few techniques for plotting correlation in R:

Correlation analysis, correlation is a term that is a measure of the strength of a relationship between two variables.
Pearson’s Product-Moment Correlation
One of the most common measures of correlation is Pearson’s product-moment correlation, which is commonly referred to simply as the correlation, or just the letter r.
Correlation shows the strength of a relationship between two variables and is expressed numerically by the correlation coefficient.

Click through for examples from several packages. H/T R-Bloggers.

Comments closed

Fitting Excel Macros into Data Science Pipelines

Published 2021-05-12 by Kevin Feasel

Bryan Shalloway has a process for us:

While I no longer use it regularly for the purposes of analysis, I will always have a soft spot in my heart for excel. Furthermore, using a “correct” set of data science tools often requires a bridge. Integrating a rigorous component into a messy spreadsheet based pipeline can be an initial step towards the pipeline or team or organization starting on a path of continuous improvement in their processes. Also, spreadsheets are foundational to many (probably most) BizOps teams and therefore are sometimes unavoidable…
In this post I will walk through a short example and some considerations for when you might decide (perhaps against your preferences) to integrate your work with extant spreadsheets or shadow “pipelines” within your organization.

Click through for Bryan’s thoughts on the topic.

Comments closed

Hot, Cool, and Large Numbers

Published 2021-05-11 by Kevin Feasel

Holger von Jouanne-Diedrich hits the casino:

The longest streak in roulette purportedly happened in 1943 in the US when the colour red won 32 consecutive times in a row! A quick calculation shows that the probability of this happening seems to be beyond crazy:
0.5^32[1] 2.328306e-10
So, what is going on here? For once streaks and clustering happen quite naturally in random sequences: if you got something like “red, black, red, black, red, black” and so on I would worry if there was any randomness involved at all (read more about this here: Learning Statistics: Randomness is a strange beast). The point is that any sequence that is defined beforehand is as probable as any other (see also my post last week: The Solution to my Viral Coin Tossing Poll). Yet streaks catch our eye, they stick out.

There’s one critical assumption in this post, which is that the game is fair, in that each event has an equal probability of happening. But as a Bayesian, if a roulette table hits red 32 times in a row, it certainly opens the door to the idea that maybe the odds on that table with that dealer aren’t quite equal between red and black.

Comments closed

Understanding Confidence & Credible Interval Widths

Published 2021-05-10 by Kevin Feasel

John Cook takes us through the notion of confidence intervals and credible intervals:

Suppose you do N trials of something that can succeed or fail. After your experiment you want to present a point estimate and a confidence interval. Or if you’re a Bayesian, you want to present a posterior mean and a credible interval. The numerical results hardly differ, though the two interpretations differ.
If you got half successes, you will report a confidence interval centered around 0.5. The more unbalanced your results were, the smaller your confidence interval will be. That is, the confidence interval will be smallest if you had no successes and widest if you had half successes.
What can we say about how the width of your confidence varies as a function of your point estimate p?

Read on to learn that answer.

Comments closed

Functional Data Analysis in R

Published 2021-05-06 by Kevin Feasel

Joseph Ricker gives us a gentle introduction to a not-so-gentle topic:

This plot might depict 80 measurements for a participant in a clinical trial where each data point represents the change in the level of some protein level. Or it could represent any series of longitudinal data where the measurements are take at irregular intervals. The curve looks like a time series with obvious correlations among the points, but there are not enough measurements to model the data with the usual time series methods. In a scenario like this, you might find Functional Data Analysis (FDA) to be a viable alternative to the usual multi-level, mixed model approach.
This post is meant to be a “gentle” introduction to doing FDA with R for someone who is totally new to the subject. I’ll show some “first steps” code, but most of the post will be about providing background and motivation for looking into FDA. I will also point out some of the available resources that a newcommer to FDA should find helpful.

Read on to learn more.

Comments closed

Category: Data Science