Press "Enter" to skip to content

Category: Data Science

Building A Model: Lumping And Splitting

Anna Schneider and Alex Smolyanskaya explain some of the tradeoffs between lumping groups together and splitting them out when it comes to algorithm selection:

At Stitch Fix, individual personalization is in our DNA: every client is unique and every piece of clothing we send is chosen to be just right. When we buy merchandise, we could choose to lump clients together; algorithms trained on lumped data would steer us toward that little black dress or those comfy leggings that delight a core, modal group of clients. Yet when we split clients into narrower segments and focus on the tails of the distribution, the algorithms have the chance to also tease out that sleek pinstripe blazer or that pair of distressed teal jeans that aren’t right for everyone, but just right for someone. As long as we don’t split our clients so finely that we’re in danger of overfitting, and as long as humans can still understand the algorithm’s recommendations, splitting is the way to go.

In other cases lumping can provide action-oriented clarity for human decision-makers. For example, we might lump clients into larger groups when reporting on business growth for a crisp understanding of holistic business health, even if our models forecast that growth at the level of finer client splits.

Read on and check out their useful chart for figuring out whether lumping or splitting is the better idea for you.

Comments closed

Precision And Recall

Brian Lee Yung Rowe makes the important point that model accuracy is not always the ultimate measure:

Now, AI companies are obliged to tell you how great their model is. They may say something like “our model is 95% accurate”. Zowee! But what does this mean exactly? In terms of binary classification it means that the model chose the correct class 95% of the time. This seems pretty good, so what’s the problem?

Suppose I create an AI that guesses the gender of a technical employee at Facbook. As of 2017, 19% of STEM roles are held by women. Behind the scenes, my model is really simple: it just chooses male every time (bonus question: is this AI?). Because of the data, my model will be 81% accurate. Now 95% doesn’t seem all that impressive. This dataset is known to be unbalanced, because the classes are not proportional. A better dataset would have about 50% women and 50% men. So asking if a dataset is balanced helps to identify some tricks that make models appear smarter than they are.

With wildly unbalanced data (like diagnosing rare diseases), measures like positive predictive value are far more important than overall accuracy.

Comments closed

A Basic Explanation Of Associative Rule Learing

Akshansh Jain has some notes on associative rules:

Support tells us that how frequent is an item, or an itemset, in all of the data. It basically tells us how popular an itemset is in the given dataset. For example, in the above-given dataset, if we look at Learning Spark, we can calculate its support by taking the number of transactions in which it has occurred and dividing it by the total number of transactions.

Support{Learning Spark} = 4/5
Support{Programming in Scala} = 2/5
Support{Learning Spark, Programming in Scala} = 1/5

Support tells us how important or interesting an itemset is, based on its number of occurrences. This is an important measure, as in real data there are millions and billions of records, and working on every itemset is pointless, as in millions of purchases if a user buys Programming in Scala and a cooking book, it would be of no interest to us.

Read the whole thing.

Comments closed

Visualizing Logistic Regression In Action

Sebastian Sauer shows using ggplot2 visuals what happens when there are interaction effects in a logistic regression:

Of course, probabilities greater 1 do not make sense. That’s the reason why we prefer a “bended” graph, such as the s-type ogive in logistic regression. Let’s plot that instead.

First, we need to get the survival probabilities:

d %>% mutate(pred_prob = predict(glm1, type = "response")) -> d

Notice that type = "response gives you the probabilities of survival (ie., of the modeled event).

Read the whole thing.

Comments closed

Understanding Confusion Matrices

Eli Bendersky explains what it is a confusion matrix tells us:

Now comes our first batch of definitions.

  • True positive (TP): Positive test result matches reality — the person is actually sick and tested positive.
  • False positive (FP): Positive test result doesn’t match reality — the test is positive but the person is not actually sick.
  • True negative (TN): Negative test result matches reality — the person is not sick and tested negative.
  • False negative (FN): Negative test result doesn’t match reality — the test is negative but the person is actually sick.

Folks get confused with these often, so here’s a useful heuristic: positive vs. negative reflects the test outcome; true vs. false reflects whether the test got it right or got it wrong.

It’s a nice read.  The next step, after understanding these, is figuring out in which circumstances we want to weigh some of these measures more than others.

Comments closed

Why Does Empirical Variance Use n-1 Instead Of n?

Sebastian Sauer gives us a simulation showing why we use n-1 instead of n as the denominator when calculating the variance of a sample:

Our results show that the variance of the sample is smaller than the empirical variance; however even the empirical variance too is a little too small compared with the population variance (which is 1). Note that sample size was n=10 in each draw of the simulation. With sample size increasing, both should get closer to the “real” (population) sample size (although the bias is negligible for the empirical variance). Let’s check that.

This is an R-heavy post and does a great job of showing that it’s necessary, and ends with  recommended reading if you want to understand the why.

Comments closed

The Data Exploration Process

Stacia Varga takes a step back from analyzing NHL data to explore it a little more:

As I mentioned in my last post, I am currently in an exploratory phase with my data analytics project. Although I would love to dive in and do some cool predictive analytics or machine learning projects, I really need to continue learning as much about my data as possible before diving into more advanced techniques.

My data exploration process has the following four steps:

  1. Assess the data that I have at a high level

  2. Determine how this data is relevant to the analytics project I want to undertake

  3. Get a general overview of the data characteristics by calculating simple statistics

  4. Understand the “middles” and the “ends” of your numeric data points

There’s some good stuff in here.  I particularly appreciate Stacia’s consideration of data exploration as an iterative process.

Comments closed

Multi-Class Text Classification In Python

Susan Li has a series on multi-class text classification in Python.  First up is analysis with PySpark:

Our task is to classify San Francisco Crime Description into 33 pre-defined categories. The data can be downloaded from Kaggle.

Given a new crime description comes in, we want to assign it to one of 33 categories. The classifier makes the assumption that each new crime description is assigned to one and only one category. This is multi-class text classification problem.

    • * Input: Descript
    • * Example: “STOLEN AUTOMOBILE”
    • * Output: Category
    • * Example: VEHICLE THEFT

To solve this problem, we will use a variety of feature extraction technique along with different supervised machine learning algorithms in Spark. Let’s get started!

Then, she looks at multi-class text classification with scikit-learn:

The classifiers and learning algorithms can not directly process the text documents in their original form, as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. Therefore, during the preprocessing step, the texts are converted to a more manageable representation.

One common approach for extracting features from the text is to use the bag of words model: a model where for each document, a complaint narrative in our case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored.

Specifically, for each term in our dataset, we will calculate a measure called Term Frequency, Inverse Document Frequency, abbreviated to tf-idf.

This is a nice pair of articles on the topic.  Natural Language Processing (and dealing with text in general) is one place where Python is well ahead of R in terms of functionality and ease of use.

Comments closed

The Microsoft Team Data Science Process Lifecycle Versus CRISP-DM

Melody Zacharias compares Microsoft’s Team Data Science Process lifecycle with the CRISP-DM process:

As I pointed out in my previous blog, the TDSP lifecycle is made up of five iterative stages:

  1. Business Understanding
  2. Data Acquisition and Understanding
  3. Modeling
  4. Deployment
  5. Customer Acceptance

This is not very different from the six major phases used by the Cross Industry Standard Process for Data Mining (“CRISP-DM”).

This is part of a series on data science that Melody is putting together, so check it out.

Comments closed

Exploratory Analysis With Hockey Data In Power BI

Stacia Varga digs into her hockey data set a bit more:

Once I know whether a variable is numerical or categorical, I can compute statistics appropriately. I’ll be delving into additional types of statistics later, but the very first, simplest statistics that I want to review are:

  • Counts for a categorical variable
  • Minimum and maximum values in addition to mean and median for a numerical value

To handle my initial analysis of the categorical variables, I can add new measures to the modelto compute the count using a DAX formula like this, since each row in the games table is unique:

Game Count = countrows(games)

It’s interesting seeing Stacia use Power BI for exploratory analysis.  My personal preference would definitely be to dump the data into R, but there’s more than one way to analyze a data set.

Comments closed