Category: Data Science

The Line is NOT the Data
One of the worst things we can do as data analysts is to interpret a regression line as the most important thing on a visual. The important thing here is the per-state set of data points, but our eyes are drawn to the line. The line mentally replaces the data, but in doing so, we lose the noise. And boy, is there a lot of noise.

This was my first point, but I think it’s the most important one to keep in mind: just because we draw a line and there’s a best fit doesn’t mean that fit is actually any good. And if the fit isn’t any good, the line is…optimistic with regard to how informative it is.

Comments closed

Computing a Z Score with R

Published 2020-02-18 by Kevin Feasel

Anisa Dhana shows us a quick example of how to calculate Z score with R:

In short, the z-score is a measure that shows how much away (below or above) of the mean is a specific value (individual) in a given dataset. In the example below, I am going to measure the z value of body mass index (BMI) in a dataset from NHANES.

Because R is a set-oriented, functional programming language, the answer is quite simple.

Comments closed

An Overview of Generative Adversarial Networks

Published 2020-02-17 by Kevin Feasel

Mohammad Waseem takes us through an overview of Generative Adversarial Networks:

Generative models are nothing but those models that use an Unsupervised Learning approach. In a generative model, there are samples in the data i.e input variables X, but it lacks the output variable Y. We use only the input variables to train the generative model and it recognizes patterns from the input variables to generate an output that is unknown and based on the training data only.
In Supervised Learning, we are more aligned towards creating predictive models from the input variables, this type of modeling is known as discriminative modeling. In a classification problem, the model has to discriminate as to which class the example belongs to. On the other hand, unsupervised models are used to create or generate new examples in the input distribution.
To define generative models in layman’s terms we can say, generative models, are able to generate new examples from the sample that are not only similar to other examples but are indistinguishable as well.

Click through for the overview.

Comments closed

Monitoring for Distribution Changes

Published 2020-02-13 by Kevin Feasel

Nina Zumel explains how we can track if something has changed by monitoring its distribution:

A client recently came to us with a question: what’s a good way to monitor data or model output for changes? That is, how can you tell if new data is distributed differently from previous data, or if the distribution of scores returned by a model have changed? This client, like many others who have faced the same problem, simply checked whether the mean and standard deviation of the data had changed more than some amount, where the threshold value they checked against was selected in a more or less ad-hoc manner. But they were curious whether there was some other, perhaps more principled way, to check for a change in distribution.

The answer is, of course, that there is. Click through to see a few of the techniques.

Comments closed

Benford’s Law in Power BI

Published 2020-02-11 by Kevin Feasel

Imke Feldmann shows how you can build up a Benford distribution in DAX:

The green columns show how often each number should be the first digit in numbers that should follow the Benford-distribution. In black you’ll see the actual distribution of first digits within my table. Lastly, the red line shows the percentual absolute deviations between actual and Benford values.
In this example, there is a relatively high occurrence of numbers starting with 4 and 5. So this could be a sign for fraudulent manipulations.

In the example, eyeballing it says things look pretty good. It’s interesting to see just how many things fit a Benford distribution, including populations, budgets (when you have enough line items), expenses, etc. Not everything does, however—high and low temperatures tend not to, either in Fahrenheit or Celsius.

Comments closed

Machine Learning through Counterfactuals

Published 2020-02-04 by Kevin Feasel

Amit Sharma announces a new library:

Consider a person who applies for a loan with a financial company, but their application is rejected by a machine learning algorithm used to determine who receives a loan from the company. How would you explain the decision made by the algorithm to this person? One option is to provide them with a list of features that contributed to the algorithm’s decision, such as income and credit score. Many of the current explanation methods provide this information by either analyzing the algorithm’s properties or approximating it with a simpler, interpretable model.
However, these explanations do not help this person decide what to do next to increase their chances of getting the loan in the future. In particular, changing the most important features for prediction may not actually change the decision, and in some cases, important features may be impossible to change, such as age. A similar argument applies when algorithms are used to support decision-makers in scenarios such as screening job applicants, deciding health insurance, or disbursing government aid.

This has the potential to be a great library. One of the issues with machine learning as it stands today is that you can get an answer, but to understand how to change the answer requires having a human understand the model. This looks like a good first step. It’s only available in Python.

Comments closed

Explaining Black Box Models with LIME

Published 2020-01-27 by Kevin Feasel

Holger von Jouanne-Diedrich takes us through the intuition of LIME:

There is a new hot area of research to make black-box models interpretable, called Explainable Artificial Intelligence (XAI), if you want to gain some intuition on one such approach (called LIME), read on!
Before we dive right into it it is important to point out when and why you would need interpretability of an AI. While it might be a desirable goal in itself it is not necessary in many fields, at least not for users of an AI, e.g. with text translation, character and speech recognition it is not that important why they do what they do but simply that they work.
In other areas, like medical applications (determining whether tissue is malignant), financial applications (granting a loan to a customer) or applications in the criminal-justice system (gauging the risk of recidivism) it is of the utmost importance (and sometimes even required by law) to know why the machine arrived at its conclusions.
One approach to make AI models explainable is called LIME for Local Interpretable Model-Agnostic Explanations. There is already a lot in this name!

LIME is not trivial to use and it can be very slow, but it is a great way to visualize models.

Comments closed

Generating Synthetic Data with R

Published 2020-01-24 by Kevin Feasel

Sidharth Macherla uses the conjurer package in R to generate synthetic data:

If you are building data science applications and need some data to demonstrate the prototype to a potential client, you will most likely need synthetic data. In this article, we discuss the steps to generating synthetic data using the R package ‘conjurer’.

One of the toughest problems of generating data is making it look realistic enough. It’s one level of difficulty to build “steady-state” data, but if you want data to follow a combination of trend and random walk…that’s when things get dicey. H/T R-Bloggers

Comments closed

Concepts in Support Vector Machines

Published 2020-01-23 by Kevin Feasel

Abhijit Telang takes us through the calculations involved in Support Vector Machines and then gives us an example in R:

So, let’s take that out and we are back to old, classical vector algebra. It’s like a person with a bunch of sticks to figure out which one to lay where in a 2-D plane to separate one class of objects from another, provided class definitions are already known.
The problem is which particular shape and length must be chosen to show maximum contrast between classes.
We need to arrive at a function definition, in such a way that the value a given function takes changes drastically (e.g. from a large positive value to a large negative value).

SVM is often great for two-class classification problems, and different variants also work well for multi-class problems.

Comments closed

Against Citizen Data Scientists

Published 2020-01-22 by Kevin Feasel

Bill Schmarzo doesn’t like the idea of “citizen data scientists” very much:

“Hello,” he says. “My name is Dr. Payne and I am your Citizen Dentist for today.”
Citizen Dentist?! You repeat the question out loud for him to hear, want an answer to this looney statement. “What is a Citizen Dentist?”
Get this. He replies, “I’m a person who performs dental work, but my proficiency and expertise is outside of the field of dentistry.”

Bill’s alternative is “Citizens of Data Science.” Click through to see what that means and how it differs.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31