Data Science – Page 40

Both results show that evaluating two tests on the same family of data will lead to a ~10% chance that a researcher will claim a “significant” result if they look for either test to reject the null. Any claim there is a maximum 5% false positive rate would be mistaken. As an exercise, verify that doing the same on \(m=4\) tests will lead to an ~18% chance!
A bad testing platform would be one that claims a maximum 5% false positive rate when any one of multiple tests on the same family of data show significance at the 5% level. Clearly, if a researcher is going to claim that the FWER is no more than \(\alpha\), then they must control for the FWER and carefully consider how individual tests reject the null.

This is worth taking some time to read carefully. H/T R-Bloggers

Comments closed

Significance, Confidence Level, and Confidence Interval

Published 2019-10-02 by Kevin Feasel

Stephanie Glen disambiguates three commonly confused but quite different terms:

In a nutshell, here are the definitions for all three.
1. Significance level: In a hypothesis test, the significance level, alpha, is the probability of making the wrong decision when the null hypothesis is true.
2. Confidence level: The probability that if a poll/test/survey were repeated over and over again, the results obtained would be the same. A confidence level = 1 – alpha.
3. Confidence interval: A range of results from a poll, experiment, or survey that would be expected to contain the population parameter of interest. For example, an average response. Confidence intervals are constructed using significance levels / confidence levels.

Read on for several examples and more elaboration.

Comments closed

Principal Component Analysis in Python

Published 2019-10-02 by Kevin Feasel

Abhinav Choudhary shows us how to implement Principal Component Analysis in Python:

Principal Component Analysis (PCA) is an unsupervised statistical technique used to examine the interrelation among a set of variables in order to identify the underlying structure of those variables. In simple words, suppose you have 30 features column in a data frame so it will help to reduce the number of features making a new feature which is the combined effect of all the feature of the data frame. It is also known as factor analysis.

PCA is quite useful in practice, though it has the unfortunate side effect of making it harder to interpret which factors are driving your solution.

Comments closed

Fun with Residual Plots

Published 2019-09-30 by Kevin Feasel

Nina Zumel explains why, when plotting residuals, you always put predictions on the X axis and residuals on the Y axis:

One reason that the proper residual graph (for a well fit model) should smooth out to the line y=0 is known as reversion to mediocrity, or regression to the mean.
Imagine that you have an ideal process that always produces a single value y. You don’t actually observe this “true value”; instead, what you observe is y plus (IID, zero mean) noise. You can build a “model” for this process that predicts the mean of the observations, in this case the value 0.1033149. Then you can calculate the residuals of your “model” in the usual way.

This post went in a direction I wasn’t expecting, and it was all the better for it.

Comments closed

Topic Modeling

Published 2019-09-30 by Kevin Feasel

Federico Pascual has an article on topic modeling and topic classification:

Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents. It’s known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.
Since topic modeling doesn’t require training, it’s a quick and easy way to start analyzing your data. However, you can’t guarantee you’ll receive accurate results, which is why many businesses opt to invest time training a topic classification model.

The article is long but worth the read, with examples in Python and additional notes for R.

Comments closed

PDF Scraping with R

Published 2019-09-26 by Kevin Feasel

Jennifer Cooper shows how you can use R to scrape data from a text-based PDF:

Here’s a diagram of the workflow I used:
1. Start with PDF
2. Use tabulizer to extract tables
3. Clean up data into “tidy” format using tidyverse (mainly dplyr)
4. Visualize trends with ggplot2

Read on for more detail on each step in the process. H/T R-Bloggers.

Comments closed

A Primer on Survey Analysis

Published 2019-09-23 by Kevin Feasel

Federico Pascual has a long primer on survey analysis:

When it comes to customer feedback, you’ll find that not all the information you get is useful to your company. This feedback can be categorized into non-insightful and insightful data. The former refers to data you had already spotted as problematic, while insightful information either helps you confirm your hypotheses or notice new issues or opportunities.
Let’s imagine your company carries out a customer satisfaction survey, and 60% of the respondents claim that the pricing of your product or service is too high. You can use that valuable data to make a decision. That’s why this data is also called actionable insights because they either lead to action, validation, or rethinking of a specific strategy you have already implemented.

Survey design and implementation can be pretty difficult. This article does a good job pushing you away from some of the pitfalls around it.

Comments closed

Linear Regression in Power BI

Published 2019-09-23 by Kevin Feasel

Joseph Yeates shows how to implement linear regression in Power BI:

The goal of a simple linear model is to fit a line onto this plot to summarize the shape of the data using the equation above.
The “a” value is the slope of the fitted line (rise over run) and the “b” value is the intercept on the y-axis (when x is equal to zero).
In the gapminder example, the life expectancy column was assigned as the “y” variable, as it is the outcome that we are interested in predicting or understanding. The year1950 column was assigned as the “x” variable, as it is what we are using to try and measure the change in life expectancy.

This is a little more complicated than adding a regression line to a scatterplot (the “normal” way to do linear regression with Power BI) but this method lets you work with the outputs in a way that the normal method doesn’t.

Comments closed

Python and R Data Reshaping

Published 2019-09-10 by Kevin Feasel

John Mount takes us through a couple of data shaping packages:

The advantages of data_algebra and cdata are:
– The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
– The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
– The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).

Let’s re-work a small R cdata example, using the Python package data_algebra.

Click through for the example.

Comments closed

When to Use Different ML Algorithms

Published 2019-09-09 by Kevin Feasel

Stefan Franczuk explains the different categories of machine learning algorithms available in Talend:

Clustering is the task of grouping together a set of objects in such a way, that objects in the same group are more similar to each other than to those in other groups. Clustering is really useful for identify separate groups and therefore is used to solve use cases such as “who are my premium customers?”.

Understanding when to use which algorithm is important. You don’t want to build out the world’s best regression if your benefactors are asking for a classifier.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Data Science

Multiple Hypothesis Testing with R

Significance, Confidence Level, and Confidence Interval

Principal Component Analysis in Python

Fun with Residual Plots

Topic Modeling

PDF Scraping with R

A Primer on Survey Analysis

Linear Regression in Power BI

Python and R Data Reshaping

When to Use Different ML Algorithms