Data Science – Page 35

Projecting Defensive Back Trajectories with Sagemaker

Published 2020-10-08 by Kevin Feasel

Lin Lee Cheong, et al, relay some interesting research:

NFL’s Next Gen Stats (NGS) powered by AWS accurately captures player and ball data in real time for every play and every NFL game—over 300 million data points per season—through the extensive use of sensors in players’ pads and the ball. With this rich set of tracking data, NGS uses AWS machine learning (ML) technology to uncover deeper insights and develop a better understanding of various aspects and trends of the game. To date, NGS metrics have focused on helping fans better appreciate and understand the offense and defense in gameplay through the application of advanced analytics, particularly in the passing game. Thanks to tracking data, it’s possible to quantify the difficulty of passes, model expected yards after catch, and determine the value of various play outcomes. A logical next step with this analytical information is to evaluate quarterback decision-making, such as whether the quarterback has considered all eligible receivers and evaluated tradeoffs accurately.
To effectively model quarterback decision-making, we considered a few key metrics—mainly the probability of different events occurring on a pass, and the value of said events. A pass can result in three outcomes: completion, incompletion, or interception. NGS has already created models that provide probabilities of these outcomes, but these events rely on information that’s available at only two points during the play: when the ball is thrown (termed as pass-forward), and when the ball arrives to a receiver (pass-arrived). Because of this, creating accurate probabilities requires modeling the trajectory of players between those two points in time.
For these probabilities, the quarterback’s decision is heavily influenced by the quality of defensive coverage on various receivers, because a receiver with a closely covered defender has a lower likelihood of pass completion compared to a receiver who is wide open due to blown coverage. Furthermore, defenders are inherently reactive to how the play progresses. Defenses move in completely different ways depending on which receiver is targeted on the pass. This means that a trajectory model for defenders has to similarly be reactive to the specified targeted receiver in a believable manner.

Click through for details on the study.

Comments closed

Probability Distributions in Real Life

Published 2020-10-07 by Kevin Feasel

Stephanie Glen gives us examples of where specific probability distributions appear naturally:

If you’re in the beginning stages of your data science credential journey, you’re either about to take (or have taken) a probability class. As part of that class, you’re introduced to several different probability distributions, like the binomial distribution, geometric distribution and uniform distribution. You might be tempted to skip over some elementary topics and just scrape by with a bare pass. Because, let’s face it–the way probability is taught (with dice rolls and cards) is far removed from the glamor of data science. You may be wondering
When am I ever going to calculate the probability of five die rolls in a row in real life?

Click through for the answer and for a chart provides different scenarios for real-world probability distributions.

Comments closed

Reasons Data Science Projects Fail

Published 2020-10-05 by Kevin Feasel

Ryohei Fujimaki summarizes some of the reasons why data science projects can fail:

According to Gartner analyst Nick Heudecker, over 85% of data science projects fail. A report from Dimensional Research indicated that only 4% of companies have succeeded in deploying ML models to production environment.
Even more critical, the economic downturn caused by the COVID-19 pandemic has placed increased pressure on data science and BI teams to deliver more with less. In this down market, organizations are reassessing which AI/ML models they should develop, how to optimize resources and how to best use valuable budget dollars for maximum impact. In this type of environment, AI/ML project failure is simply not acceptable.

That 85% sounds suspiciously like the percentage of failed business intelligence and data warehouse projects, as well as the percentage of failed big data projects. It’s close enough that it makes me want to come up with some overarching idea that projects based on the consolidation of multiple independent data systems across several business units are liable to fail about 5/6 of the time.

Comments closed

Time Series Forecasting in R

Published 2020-10-02 by Kevin Feasel

Selcuk Disci contrasts a couple of methods for time series forecasting:

It is always hard to find a proper model to forecast time series data. One of the reasons is that models that use time-series data often expose to serial correlation. In this article, we will compare k nearest neighbor (KNN) regression which is a supervised machine learning method, with a more classical and stochastic process, autoregressive integrated moving average (ARIMA).
We will use the monthly prices of refined gold futures(XAUTRY) for one gram in Turkish Lira traded on BIST(Istanbul Stock Exchange) for forecasting. We created the data frame starting from 2013. You can download the relevant excel file from here.

Click through for the demonstration. H/T R-Bloggers.

Comments closed

What Makes for a Good Estimator?

Published 2020-10-01 by Kevin Feasel

Jasmine Nettiksimmons and Molly Davies explain what estimators are:

What makes a good estimator? What is an estimator? Why should I care? There is an entire branch of statistics called Estimation Theory that concerns itself with these questions and we have no intention of doing it justice in a single blog post. However, modern effect estimation has come a long way in recent years and we’re excited to share some of the methods we’ve been using in an upcoming post. This will serve as a gentle introduction to the topic and a foundation for understanding what makes some of these modern estimators so exciting.

Read on for a very nice introduction to the topic.

Comments closed

Correlation and Predictive Power Score in Python

Published 2020-09-18 by Kevin Feasel

Abhinav Choudhary looks at two methods for understanding the relationship between variables:

dataframes while working in python which is supported by the pandas library. Pandas come with a function corr() which can be used in order to find relation amongst the various columns of the data frame.
Syntax :DataFrame.corr()
Returns:dataframe with value between -1 and 1
For details and parameter about the function check out Link
Let’s try this in action.

Read on to see how it works, how to visualize results, and where Predictive Power Score can be a better option.

Comments closed

The Power of AUC

Published 2020-09-18 by Kevin Feasel

John Mount takes a deeper look at Area Under the Curve:

I am finishing up a work-note that has some really neat implications as to why working with AUC is more powerful than one might think.
I think I am far enough along to share the consequences here. This started as some, now reappraised, thoughts on the fallacy of thinking knowing the AUC (area under the curve) means you know the shape of the ROC plot (receiver operating characteristic plot]. I now think for many practical applications the AUC number carries a lot more information about the ROC shape than one might expect.

Read on for the explanation.

Comments closed

Principal Component Analysis in Azure ML

Published 2020-09-01 by Kevin Feasel

Dinesh Asanka walks us through Principal Component Analysis as an Azure ML Studio data transformation technique:

We will be discussing one of the most common Data Reduction Technique named Principal Component Analysis in Azure Machine Learning in this article. After discussing the basic cleaning techniques, feature selection techniques in previous articles, now we will be looking at a data reduction technique in this article.
Data Reduction mechanism can be used to reduce the representation of the large dimensional data. By using a data reduction technique, you can reduce the dimensionality that will improve the manageability and visualability of data. Further, you can achieve similar accuracies.

Read on for the demo.

Comments closed

Explaining the ROC Plot

Published 2020-08-25 by Kevin Feasel

Nina Zumel takes us through what each element of a ROC curve means:

In our data science teaching, we present the ROC plot (and the area under the curve of the plot, or AUC) as a useful tool for evaluating score-based classifier models, as well as for comparing multiple such models. The ROC is informative and useful, but it’s also perhaps overly concise for a beginner. This leads to a lot of questions from the students: what does the ROC tell us about a model? Why is a bigger AUC better? What does it all mean?

Read on for the answer.

Comments closed

Fun with Benford’s Law

Published 2020-08-19 by Kevin Feasel

Nagdev Amruthnath covers a topic which brings me joy:

Benford’s Law is one of the most underrated and widely used techniques that are commonly used in various applications. United States IRS neither confirms nor denies their use of Benford’s law to detect any number of manipulations in income tax filing. Across the Atlantic, the EU is very open and proudly claims its use of Benford’s law. Today, this is widely used in accounting to detect any fraud. Nigrini, a professor at the University of Cape Town, also used this law to identify financial discrepancies in Enron’s financial statement. In another case, Jennifer Golbeck, a professor at the University of Maryland, was able to identify bot accounts on twitter using Benford’s law. Xiaoyu Wang from the University of Winnipeg even published a report on how to use Benford’s law on images. In the rest of this article, we will take about Benford’s law and how it can be applied using R.

The applications to images and music were new to me. Very cool. H/T R-Bloggers

Comments closed

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Category: Data Science