Data Science – Page 32

According to Gartner analyst Nick Heudecker, over 85% of data science projects fail. A report from Dimensional Research indicated that only 4% of companies have succeeded in deploying ML models to production environment.
Even more critical, the economic downturn caused by the COVID-19 pandemic has placed increased pressure on data science and BI teams to deliver more with less. In this down market, organizations are reassessing which AI/ML models they should develop, how to optimize resources and how to best use valuable budget dollars for maximum impact. In this type of environment, AI/ML project failure is simply not acceptable.

That 85% sounds suspiciously like the percentage of failed business intelligence and data warehouse projects, as well as the percentage of failed big data projects. It’s close enough that it makes me want to come up with some overarching idea that projects based on the consolidation of multiple independent data systems across several business units are liable to fail about 5/6 of the time.

Comments closed

Time Series Forecasting in R

Published 2020-10-02 by Kevin Feasel

Selcuk Disci contrasts a couple of methods for time series forecasting:

It is always hard to find a proper model to forecast time series data. One of the reasons is that models that use time-series data often expose to serial correlation. In this article, we will compare k nearest neighbor (KNN) regression which is a supervised machine learning method, with a more classical and stochastic process, autoregressive integrated moving average (ARIMA).
We will use the monthly prices of refined gold futures(XAUTRY) for one gram in Turkish Lira traded on BIST(Istanbul Stock Exchange) for forecasting. We created the data frame starting from 2013. You can download the relevant excel file from here.

Click through for the demonstration. H/T R-Bloggers.

Comments closed

What Makes for a Good Estimator?

Published 2020-10-01 by Kevin Feasel

Jasmine Nettiksimmons and Molly Davies explain what estimators are:

What makes a good estimator? What is an estimator? Why should I care? There is an entire branch of statistics called Estimation Theory that concerns itself with these questions and we have no intention of doing it justice in a single blog post. However, modern effect estimation has come a long way in recent years and we’re excited to share some of the methods we’ve been using in an upcoming post. This will serve as a gentle introduction to the topic and a foundation for understanding what makes some of these modern estimators so exciting.

Read on for a very nice introduction to the topic.

Comments closed

Correlation and Predictive Power Score in Python

Published 2020-09-18 by Kevin Feasel

Abhinav Choudhary looks at two methods for understanding the relationship between variables:

dataframes while working in python which is supported by the pandas library. Pandas come with a function corr() which can be used in order to find relation amongst the various columns of the data frame.
Syntax :DataFrame.corr()
Returns:dataframe with value between -1 and 1
For details and parameter about the function check out Link
Let’s try this in action.

Read on to see how it works, how to visualize results, and where Predictive Power Score can be a better option.

Comments closed

The Power of AUC

Published 2020-09-18 by Kevin Feasel

John Mount takes a deeper look at Area Under the Curve:

I am finishing up a work-note that has some really neat implications as to why working with AUC is more powerful than one might think.
I think I am far enough along to share the consequences here. This started as some, now reappraised, thoughts on the fallacy of thinking knowing the AUC (area under the curve) means you know the shape of the ROC plot (receiver operating characteristic plot]. I now think for many practical applications the AUC number carries a lot more information about the ROC shape than one might expect.

Read on for the explanation.

Comments closed

Principal Component Analysis in Azure ML

Published 2020-09-01 by Kevin Feasel

Dinesh Asanka walks us through Principal Component Analysis as an Azure ML Studio data transformation technique:

We will be discussing one of the most common Data Reduction Technique named Principal Component Analysis in Azure Machine Learning in this article. After discussing the basic cleaning techniques, feature selection techniques in previous articles, now we will be looking at a data reduction technique in this article.
Data Reduction mechanism can be used to reduce the representation of the large dimensional data. By using a data reduction technique, you can reduce the dimensionality that will improve the manageability and visualability of data. Further, you can achieve similar accuracies.

Read on for the demo.

Comments closed

Explaining the ROC Plot

Published 2020-08-25 by Kevin Feasel

Nina Zumel takes us through what each element of a ROC curve means:

In our data science teaching, we present the ROC plot (and the area under the curve of the plot, or AUC) as a useful tool for evaluating score-based classifier models, as well as for comparing multiple such models. The ROC is informative and useful, but it’s also perhaps overly concise for a beginner. This leads to a lot of questions from the students: what does the ROC tell us about a model? Why is a bigger AUC better? What does it all mean?

Read on for the answer.

Comments closed

Fun with Benford’s Law

Published 2020-08-19 by Kevin Feasel

Nagdev Amruthnath covers a topic which brings me joy:

Benford’s Law is one of the most underrated and widely used techniques that are commonly used in various applications. United States IRS neither confirms nor denies their use of Benford’s law to detect any number of manipulations in income tax filing. Across the Atlantic, the EU is very open and proudly claims its use of Benford’s law. Today, this is widely used in accounting to detect any fraud. Nigrini, a professor at the University of Cape Town, also used this law to identify financial discrepancies in Enron’s financial statement. In another case, Jennifer Golbeck, a professor at the University of Maryland, was able to identify bot accounts on twitter using Benford’s law. Xiaoyu Wang from the University of Winnipeg even published a report on how to use Benford’s law on images. In the rest of this article, we will take about Benford’s law and how it can be applied using R.

The applications to images and music were new to me. Very cool. H/T R-Bloggers

Comments closed

Covariance and Multicollinearity

Published 2020-08-14 by Kevin Feasel

Mattan Ben-Shachar gives us an intuitive understanding of multicollinearity and how it can affect an analysis:

The common and almost default approach is to fix age to a constant. This is really what our model does in the first place: the coefficient of height represents the expected change in weight while age is fixed and not allowed to vary. What constant? A natural candidate (and indeed emmeans’ default) is the mean. In our case, the mean age is 14.9 years. So the expected values produced above are for three 14.9 year olds with different heights. But is this data plausible? If I told you I saw a person who was 120cm tall, would you also assume they were 14.9 years old?
No, you would not. And that is exactly what covariance and multicollinearity mean – that some combinations of predictors are more likely than others.

I liked the explanation Mattan provides us. Also be sure to read the warnings near the end of the post around other things to try. H/T R-bloggers

Comments closed

Classification Problems and Classification Rules

Published 2020-08-14 by Kevin Feasel

John Mount warns against simply returning a class in a classification problem:

This statement is a bit of word-play which I will need to unroll a bit. However, the concrete advice is that you often get better results using models that return a continuous score for classification problems. You should make that numeric score available to downstream business logic instead of making a class choice at model prediction time. Informally the word “classifier” to informally mean “scoring procedure for classes” is not that harmful. Losing a numeric score is harmful.

Read the whole thing, as John lays out a good argument.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Data Science

Reasons Data Science Projects Fail

Time Series Forecasting in R

What Makes for a Good Estimator?

Correlation and Predictive Power Score in Python

The Power of AUC

Principal Component Analysis in Azure ML

Explaining the ROC Plot

Fun with Benford’s Law

Covariance and Multicollinearity

Classification Problems and Classification Rules