Data Science – Page 32

Plotting Correlation Analyses in R

Published 2021-05-17 by Kevin Feasel

Finnstats shows a few techniques for plotting correlation in R:

Correlation analysis, correlation is a term that is a measure of the strength of a relationship between two variables.
Pearson’s Product-Moment Correlation
One of the most common measures of correlation is Pearson’s product-moment correlation, which is commonly referred to simply as the correlation, or just the letter r.
Correlation shows the strength of a relationship between two variables and is expressed numerically by the correlation coefficient.

Click through for examples from several packages. H/T R-Bloggers.

Comments closed

Fitting Excel Macros into Data Science Pipelines

Published 2021-05-12 by Kevin Feasel

Bryan Shalloway has a process for us:

While I no longer use it regularly for the purposes of analysis, I will always have a soft spot in my heart for excel. Furthermore, using a “correct” set of data science tools often requires a bridge. Integrating a rigorous component into a messy spreadsheet based pipeline can be an initial step towards the pipeline or team or organization starting on a path of continuous improvement in their processes. Also, spreadsheets are foundational to many (probably most) BizOps teams and therefore are sometimes unavoidable…
In this post I will walk through a short example and some considerations for when you might decide (perhaps against your preferences) to integrate your work with extant spreadsheets or shadow “pipelines” within your organization.

Click through for Bryan’s thoughts on the topic.

Comments closed

Hot, Cool, and Large Numbers

Published 2021-05-11 by Kevin Feasel

Holger von Jouanne-Diedrich hits the casino:

The longest streak in roulette purportedly happened in 1943 in the US when the colour red won 32 consecutive times in a row! A quick calculation shows that the probability of this happening seems to be beyond crazy:
0.5^32[1] 2.328306e-10
So, what is going on here? For once streaks and clustering happen quite naturally in random sequences: if you got something like “red, black, red, black, red, black” and so on I would worry if there was any randomness involved at all (read more about this here: Learning Statistics: Randomness is a strange beast). The point is that any sequence that is defined beforehand is as probable as any other (see also my post last week: The Solution to my Viral Coin Tossing Poll). Yet streaks catch our eye, they stick out.

There’s one critical assumption in this post, which is that the game is fair, in that each event has an equal probability of happening. But as a Bayesian, if a roulette table hits red 32 times in a row, it certainly opens the door to the idea that maybe the odds on that table with that dealer aren’t quite equal between red and black.

Comments closed

Understanding Confidence & Credible Interval Widths

Published 2021-05-10 by Kevin Feasel

John Cook takes us through the notion of confidence intervals and credible intervals:

Suppose you do N trials of something that can succeed or fail. After your experiment you want to present a point estimate and a confidence interval. Or if you’re a Bayesian, you want to present a posterior mean and a credible interval. The numerical results hardly differ, though the two interpretations differ.
If you got half successes, you will report a confidence interval centered around 0.5. The more unbalanced your results were, the smaller your confidence interval will be. That is, the confidence interval will be smallest if you had no successes and widest if you had half successes.
What can we say about how the width of your confidence varies as a function of your point estimate p?

Read on to learn that answer.

Comments closed

Functional Data Analysis in R

Published 2021-05-06 by Kevin Feasel

Joseph Ricker gives us a gentle introduction to a not-so-gentle topic:

This plot might depict 80 measurements for a participant in a clinical trial where each data point represents the change in the level of some protein level. Or it could represent any series of longitudinal data where the measurements are take at irregular intervals. The curve looks like a time series with obvious correlations among the points, but there are not enough measurements to model the data with the usual time series methods. In a scenario like this, you might find Functional Data Analysis (FDA) to be a viable alternative to the usual multi-level, mixed model approach.
This post is meant to be a “gentle” introduction to doing FDA with R for someone who is totally new to the subject. I’ll show some “first steps” code, but most of the post will be about providing background and motivation for looking into FDA. I will also point out some of the available resources that a newcommer to FDA should find helpful.

Read on to learn more.

Comments closed

Random Sequences and Probabilities

Published 2021-04-29 by Kevin Feasel

Holger von Jouanne-Diedrich explains the results of a poll:

Some time ago I conducted a poll on LinkedIn that quickly went viral. I asked which of three different coin tossing sequences were more likely and I received exactly 1,592 votes! Nearly 48,000 people viewed it and more than 80 comments are under the post (you need a LinkedIn account to fully see it here: LinkedIn Coin Tossing Poll).
In this post I will give the solution with some background explanation, so read on!

Read on to understand why it’s just as likely that you’ll see a sequence, when flipping a coin, of H,H,H,H,H,H just as often as you’ll see H,T,H,T,H,T.

Comments closed

Check Those Feature Distributions

Published 2021-04-16 by Kevin Feasel

Antoine Rebecq shares a warning:

I was recently working on a cool dataset that looked unusually friendly. It was tidy, neat, interesting… the kind of things that you rarely encounter in the wild! My goal was to build a super simple predictor for one of the features. However, I kept getting poor results and at first couldn’t figure out what was happening.

There’s some good, practical advice in there, so check it out. H/T R-Bloggers

Comments closed

Geospatial Fraud Detection

Published 2021-04-15 by Kevin Feasel

Antoine Amend uses Databricks to identify financial fraud in a geographical area:

As part of this real-world solution, we are releasing a new open source geospatial library, GEOSCAN, to detect geospatial behaviors at massive scale, track customers patterns over time and detect anomalous card transactions. Finally, we demonstrate how organizations can surface anomalies from an analytics environment to an online data store (ODS) with tight SLA requirements following a Lambda-like infrastructure underpinned by Delta Lake, Apache Spark and MLflow.

Click through for the article, as well as three notebooks.

Comments closed

Simulating Prediction Intervals

Published 2021-04-12 by Kevin Feasel

Bryan Shalloway continues a series:

Part 1 of my series of posts on building prediction intervals used data held-out from model training to evaluate the characteristics of prediction intervals. In this post I will use hold-out data to estimate the width of the prediction intervals directly. Doing such can provide more reasonable and flexible intervals compared to analytic approaches.

Click through for the article, and be sure to check out part 1 if you haven’t already.

Comments closed

Working with Prediction Intervals

Published 2021-03-31 by Kevin Feasel

Bryan Shalloway explains how generating prediction intervals is different from making point predictions:

Before using the model for predictive inference, one should have reviewed overall performance on a holdout dataset to ensure the model is sufficiently accurate for the business context. For example, for our problem is an average error of ~12% and 90% prediction intervals of +/- ~25% of Sale_Price useful? If the answer is “no,” that suggests the need for more effort in improving the accuracy of the model (e.g. trying other transformations, features, model types). For our examples we are assuming the answer is ‘yes,’ our model is accurate enough (so it is appropriate to move-on and focus on prediction intervals).

Click through for the article.

Comments closed

Category: Data Science