# Category: Data Science

Training data are used to fit each model. Training a model involves using an algorithm to determine model parameters (e.g., weights) or other logic to map inputs (independent variables) to a target (dependent variable). Model fitting can also include input variable (feature) selection. Models are trained by minimizing an error function.

For illustration purposes, let’s say we have a very simple ordinary least squares regression model with one input (independent variable, x) and one output (dependent variable, y). Perhaps our input variable is how many hours of training a dog or cat has received, and the output variable is the combined total of how many fingers or limbs we will lose in a single encounter with the animal.

Read on for some good notes, including the difference between mean squared error and average squared error.

Both results show that evaluating two tests on the same family of data will lead to a ~10% chance that a researcher will claim a “significant” result if they look for either test to reject the null. Any claim there is a maximum 5% false positive rate would be mistaken. As an exercise, verify that doing the same on $$m=4$$ tests will lead to an ~18% chance!

A bad testing platform would be one that claims a maximum 5% false positive rate when any one of multiple tests on the same family of data show significance at the 5% level. Clearly, if a researcher is going to claim that the FWER is no more than $$\alpha$$, then they must control for the FWER and carefully consider how individual tests reject the null.

This is worth taking some time to read carefully. H/T R-Bloggers

In a nutshell, here are the definitions for all three.

1. Significance level: In a hypothesis test, the significance level, alpha, is the probability of making the wrong decision when the null hypothesis is true.
2. Confidence level: The probability that if a poll/test/survey were repeated over and over again, the results obtained would be the same. A confidence level =  1 – alpha.
3. Confidence interval: A range of results from a poll, experiment, or survey that would be expected to contain the population parameter of interest. For example, an average response. Confidence intervals are constructed using significance levels / confidence levels.

Read on for several examples and more elaboration.

Principal Component Analysis (PCA) is an unsupervised statistical technique used to examine the interrelation among a set of variables in order to identify the underlying structure of those variables. In simple words, suppose you have 30 features column in a data frame so it will help to reduce the number of features making a new feature which is the combined effect of all the feature of the data frame. It is also known as factor analysis.

PCA is quite useful in practice, though it has the unfortunate side effect of making it harder to interpret which factors are driving your solution.

One reason that the proper residual graph (for a well fit model) should smooth out to the line y=0 is known as reversion to mediocrity, or regression to the mean.

Imagine that you have an ideal process that always produces a single value y. You don’t actually observe this “true value”; instead, what you observe is y plus (IID, zero mean) noise. You can build a “model” for this process that predicts the mean of the observations, in this case the value 0.1033149. Then you can calculate the residuals of your “model” in the usual way.

This post went in a direction I wasn’t expecting, and it was all the better for it.

Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents. It’s known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.

Since topic modeling doesn’t require training, it’s a quick and easy way to start analyzing your data. However, you can’t guarantee you’ll receive accurate results, which is why many businesses opt to invest time training a topic classification model.

The article is long but worth the read, with examples in Python and additional notes for R.

Here’s a diagram of the workflow I used:
2. Use tabulizer to extract tables
3. Clean up data into “tidy” format using tidyverse (mainly dplyr)
4. Visualize trends with ggplot2

Read on for more detail on each step in the process. H/T R-Bloggers.

When it comes to customer feedback, you’ll find that not all the information you get is useful to your company. This feedback can be categorized into non-insightful and insightful data. The former refers to data you had already spotted as problematic, while insightful information either helps you confirm your hypotheses or notice new issues or opportunities.

Let’s imagine your company carries out a customer satisfaction survey, and 60% of the respondents claim that the pricing of your product or service is too high. You can use that valuable data to make a decision. That’s why this data is also called actionable insights because they either lead to action, validation, or rethinking of a specific strategy you have already implemented.

Survey design and implementation can be pretty difficult. This article does a good job pushing you away from some of the pitfalls around it.

The goal of a simple linear model is to fit a line onto this plot to summarize the shape of the data using the equation above.

The “a” value is the slope of the fitted line (rise over run) and the “b” value is the intercept on the y-axis (when x is equal to zero).

In the gapminder example, the life expectancy column was assigned as the “y” variable, as it is the outcome that we are interested in predicting or understanding. The year1950 column was assigned as the “x” variable, as it is what we are using to try and measure the change in life expectancy.

This is a little more complicated than adding a regression line to a scatterplot (the “normal” way to do linear regression with Power BI) but this method lets you work with the outputs in a way that the normal method doesn’t.