Data Science – Page 31

The Intuition Behind Averaging

Published 2020-12-29 by Kevin Feasel

In this diagram, there are a bunch of numbers and a single question mark. Behind the question, is also a number. The known numbers are the same as in our friend v above.
Our task is as follows:
– Make a guess on what that mystery number could be. And,
– If we can’t get it right, then reduce, as much as possible, the error we incur on our guess.

This is a well-written explanation of an important concept. H/T R-Bloggers

Comments closed

Stochastic Processes in R

Published 2020-12-02 by Kevin Feasel

David Robinson takes us through simulation of a random walk in R:

What’s fun about this problem is that it’s an example of a random walk: a stochastic process made up of a sequence of random steps (in this case, left or right). What makes this a fun variation is that it’s a random walk in a circle- passing 5 to the left is the same as passing 15 to the right. I wasn’t previously familiar with a random walk in a circle, so I approached it through simulation to learn about its properties.

Click through for a simulation. Or 50,000 of them.

Comments closed

Understanding Decision Trees

Published 2020-12-01 by Kevin Feasel

Ram Tavva walks us through the algorithm to design decision trees:

A decision tree is made up of several nodes:
1.Root Node: A Root Node represents the entire data and the starting point of the tree. From the above example the
First Node where we are checking the first condition, whether the movie belongs to Hollywood or not that is the
Rood node from which the entire tree grows
2.Leaf Node: A Leaf Node is the end node of the tree, which can’t split into further nodes.
From the above example ‘watch movie’ and ‘Don’t watch ‘are leaf nodes.
3.Parent/Child Nodes: A Node that splits into a further node will be the parent node for the successor nodes. The
nodes which are obtained from the previous node will be child nodes for the above node.

Read on for an example of implementation in R.

Comments closed

The Effects of Undersampling and Oversampling on Predicted Probability

Published 2020-12-01 by Kevin Feasel

Bryan Shalloway has an interesting article for us:

In classification problems, under and over sampling techniques shift the distribution of predicted probabilities towards the minority class. If your problem requires accurate probabilities you will need to adjust your predictions in some way during post-processing (or at another step) to account for this.

Bryan has a clear example showing this problem in action.

Comments closed

Defining Simple Regression

Published 2020-11-24 by Kevin Feasel

Vincent Granville has a new algorithm for us:

The model-free, data-driven technique discussed here is so basic that it can easily be implemented in Excel, and we actually provide an Excel implementation. It is surprising that this technique does not pre-date standard linear regression, and is rarely if ever used by statisticians and data scientists. It is related to kriging and nearest neighbor interpolation, and apparently first mentioned in 1965 by Harvard scientists working on GIS (geographic information systems). It was referred back then as Shepard’s method or inverse distance weighting, and used for multivariate interpolation on non-regular grids (see here and here). We call this technique simple regression.

Read on to learn more about simple regression, including how to implement it and how it performs.

Comments closed

Understanding Skewness and Kurtosis

Published 2020-11-10 by Kevin Feasel

George Pipis explains two key concepts of a distribution:

Most commonly a distribution is described by its mean and variance which are the first and second moments respectively. Another less common measures are the skewness (third moment) and the kurtosis (fourth moment). Today, we will try to give a brief explanation of these measures and we will show how we can calculate them in R.

Click through for an explanation. H/T R-Bloggers

Comments closed

GIS Capabilities in R

Published 2020-11-05 by Kevin Feasel

Lionel Hertzog shows off spatial capabilities in R:

All of these operations follow the same logic, st_operation(A, B) checks for each combinations of the geometries in A and B whether A operation B is true or false. For instance st_within(A, B) checks whether the geometries in A are within B, this is similar to st_contains(B, A), the difference between the two being the shape of the returned object. If A has n geometries and B has m, st_contains(B, A) returns a list of length m where each elements contains the row IDs (numbers between 1 and n) of the geometries in A satisfying the operation. By using sparse=FALSE the functions returns matrices, like st_within(A, B, sparse=FALSE) returns a n x m matrix, st_within(B, A, sparse=FALSE) returns a m x n matrix. Note that running st_operation(A, A) checks the operation between all geometries of the object, so returning a n x n matrix.

Click through for part 1 of the series.

Comments closed

A Myth Surrounding Confidence Intervals

Published 2020-11-05 by Kevin Feasel

Sebastian Sauer takes us through confidence intervals:

Some one claims:
“A CI is the interval where 95% of all possible sample means fall into”.
Let’s check that.

Read the whole thing. Confidence intervals have their value, but this is one case where the frequentist model of statistics can trip you up between hypothetical and real, changing an accurate statement into a myth.

Comments closed

Outlier Identification Using Spark 3.0

Published 2020-10-27 by Kevin Feasel

Tori Tompkins takes us through principles of anomaly detection in Apache Spark 3.0:

To calculate Median Absolute Deviation (MAD) you need to calculate the difference between the value and the median. In simpler terms, you will need to calculate the median of the entire dataset, the difference between each value and this median, then take another median of all the differences.
In Spark you can use a SQL expression ‘percentile()’ to calculate any medians or quartiles in a dataframe. ‘percentile()’ expects a column and an array of percentiles to calculate (for median we can provide ‘array(0.5)’ because we want the 50% value ie median) and will return an array of results.
Like standard deviation, to use MAD to identify the outliers it needs to be a certain number of MAD’s away. This number is also referred to as the threshold and is defaulted to 3.

Read on for three measures and their implementations in PySpark.

Comments closed

Against Logistic Regression

Published 2020-10-22 by Kevin Feasel

Charles Thibault explains why you might not want to use logistic regression at all:

I wrote a post on my LinkedIn about why you should NEVER run a Logistic Regression. (Unless you really have to).

Click through for the kernel of the argument or check out the video.

Comments closed

Category: Data Science