Category: Data Science

Generative models are nothing but those models that use an Unsupervised Learning approach. In a generative model, there are samples in the data i.e input variables X, but it lacks the output variable Y. We use only the input variables to train the generative model and it recognizes patterns from the input variables to generate an output that is unknown and based on the training data only.
In Supervised Learning, we are more aligned towards creating predictive models from the input variables, this type of modeling is known as discriminative modeling. In a classification problem, the model has to discriminate as to which class the example belongs to. On the other hand, unsupervised models are used to create or generate new examples in the input distribution.
To define generative models in layman’s terms we can say, generative models, are able to generate new examples from the sample that are not only similar to other examples but are indistinguishable as well.

Click through for the overview.

Comments closed

Monitoring for Distribution Changes

Published 2020-02-13 by Kevin Feasel

Nina Zumel explains how we can track if something has changed by monitoring its distribution:

A client recently came to us with a question: what’s a good way to monitor data or model output for changes? That is, how can you tell if new data is distributed differently from previous data, or if the distribution of scores returned by a model have changed? This client, like many others who have faced the same problem, simply checked whether the mean and standard deviation of the data had changed more than some amount, where the threshold value they checked against was selected in a more or less ad-hoc manner. But they were curious whether there was some other, perhaps more principled way, to check for a change in distribution.

The answer is, of course, that there is. Click through to see a few of the techniques.

Comments closed

Benford’s Law in Power BI

Published 2020-02-11 by Kevin Feasel

Imke Feldmann shows how you can build up a Benford distribution in DAX:

The green columns show how often each number should be the first digit in numbers that should follow the Benford-distribution. In black you’ll see the actual distribution of first digits within my table. Lastly, the red line shows the percentual absolute deviations between actual and Benford values.
In this example, there is a relatively high occurrence of numbers starting with 4 and 5. So this could be a sign for fraudulent manipulations.

In the example, eyeballing it says things look pretty good. It’s interesting to see just how many things fit a Benford distribution, including populations, budgets (when you have enough line items), expenses, etc. Not everything does, however—high and low temperatures tend not to, either in Fahrenheit or Celsius.

Comments closed

Machine Learning through Counterfactuals

Published 2020-02-04 by Kevin Feasel

Amit Sharma announces a new library:

Consider a person who applies for a loan with a financial company, but their application is rejected by a machine learning algorithm used to determine who receives a loan from the company. How would you explain the decision made by the algorithm to this person? One option is to provide them with a list of features that contributed to the algorithm’s decision, such as income and credit score. Many of the current explanation methods provide this information by either analyzing the algorithm’s properties or approximating it with a simpler, interpretable model.
However, these explanations do not help this person decide what to do next to increase their chances of getting the loan in the future. In particular, changing the most important features for prediction may not actually change the decision, and in some cases, important features may be impossible to change, such as age. A similar argument applies when algorithms are used to support decision-makers in scenarios such as screening job applicants, deciding health insurance, or disbursing government aid.

This has the potential to be a great library. One of the issues with machine learning as it stands today is that you can get an answer, but to understand how to change the answer requires having a human understand the model. This looks like a good first step. It’s only available in Python.

Comments closed

Explaining Black Box Models with LIME

Published 2020-01-27 by Kevin Feasel

Holger von Jouanne-Diedrich takes us through the intuition of LIME:

There is a new hot area of research to make black-box models interpretable, called Explainable Artificial Intelligence (XAI), if you want to gain some intuition on one such approach (called LIME), read on!
Before we dive right into it it is important to point out when and why you would need interpretability of an AI. While it might be a desirable goal in itself it is not necessary in many fields, at least not for users of an AI, e.g. with text translation, character and speech recognition it is not that important why they do what they do but simply that they work.
In other areas, like medical applications (determining whether tissue is malignant), financial applications (granting a loan to a customer) or applications in the criminal-justice system (gauging the risk of recidivism) it is of the utmost importance (and sometimes even required by law) to know why the machine arrived at its conclusions.
One approach to make AI models explainable is called LIME for Local Interpretable Model-Agnostic Explanations. There is already a lot in this name!

LIME is not trivial to use and it can be very slow, but it is a great way to visualize models.

Comments closed

Generating Synthetic Data with R

Published 2020-01-24 by Kevin Feasel

Sidharth Macherla uses the conjurer package in R to generate synthetic data:

If you are building data science applications and need some data to demonstrate the prototype to a potential client, you will most likely need synthetic data. In this article, we discuss the steps to generating synthetic data using the R package ‘conjurer’.

One of the toughest problems of generating data is making it look realistic enough. It’s one level of difficulty to build “steady-state” data, but if you want data to follow a combination of trend and random walk…that’s when things get dicey. H/T R-Bloggers

Comments closed

Concepts in Support Vector Machines

Published 2020-01-23 by Kevin Feasel

Abhijit Telang takes us through the calculations involved in Support Vector Machines and then gives us an example in R:

So, let’s take that out and we are back to old, classical vector algebra. It’s like a person with a bunch of sticks to figure out which one to lay where in a 2-D plane to separate one class of objects from another, provided class definitions are already known.
The problem is which particular shape and length must be chosen to show maximum contrast between classes.
We need to arrive at a function definition, in such a way that the value a given function takes changes drastically (e.g. from a large positive value to a large negative value).

SVM is often great for two-class classification problems, and different variants also work well for multi-class problems.

Comments closed

Against Citizen Data Scientists

Published 2020-01-22 by Kevin Feasel

Bill Schmarzo doesn’t like the idea of “citizen data scientists” very much:

“Hello,” he says. “My name is Dr. Payne and I am your Citizen Dentist for today.”
Citizen Dentist?! You repeat the question out loud for him to hear, want an answer to this looney statement. “What is a Citizen Dentist?”
Get this. He replies, “I’m a person who performs dental work, but my proficiency and expertise is outside of the field of dentistry.”

Bill’s alternative is “Citizens of Data Science.” Click through to see what that means and how it differs.

Comments closed

Using Koalas on Azure Databricks

Published 2020-01-21 by Kevin Feasel

Ginger Grant shows how you can install the koalas library on an Azure Databricks cluster:

Unfortunately if you are using an ML workspace, this will not work and you will get the error message org.apache.spark.SparkException: Library utilities are not available on Databricks Runtime for Machine Learning. The Koalas github documentation says “In the future, we will package Koalas out-of-the-box in both the regular Databricks Runtime and Databricks Runtime for Machine Learning”. What this means is if you want to use it now
Most of the time I want to install on the whole cluster as I segment libraries by cluster. This way if I want those libraries I just connect to the cluster that has them. Now the easiest way to install a library is to open up a running Databricks cluster (start it if it is not running) then go to the Libraries tab at the top of the screen.

Click through for a demo of what you need to do.

Comments closed

Choosing Categorical Features with Python

Published 2020-01-20 by Kevin Feasel

Mesfin Gebeyaw shows how to use Multiple Correspondence Analysis to filter categorical variables for an analysis:

A general guide to interpreting the multiple correspondence analysis plot shown above for business insights would be to make a note as to how close input categorical features are to the target variable customer churn and to each other. For instance, senior citizens, customers with fiber optic internet service, those with month to month contractual agreements, and single customers or customers with no dependents are being related to a short tenure with the company and a propensity of high risk to churn. On the other hand, customers with more than a year contract, those with DSL internet service, younger customers, customers with multiple lines are being related to a long tenure with the company and a higher tendency to stay with company.

Read the whole thing.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31