Press "Enter" to skip to content

Category: Data Science

Evaluating Regression Models in Azure ML

Dan Fitton continues a series on model evaluation with Azure Machine Learning:

The initial go-to metric for understanding a regression model is the R squared (or R2) value, also known as the coefficient of determination. R squared measures how well the model is fitted to the data – the goodness of fit. It indicates how much of the variation of y (the target) is explained by the variation in x (the features).

The measures are bog standard if you’ve worked with regressions before, and Dan does a good job explaining them.

Comments closed

Python Cross-Validation

John Mount has some advice if you’re doing cross-validation in Python:

Here is a quick, simple, and important tip for doing machine learning, data science, or statistics in Python: don’t use the default cross validation settings. The default can default to a deterministic, and even ordered split, which is not in general what one wants or expects from a statistical point of view. From a software engineering point of view the defaults may be sensible as since they don’t touch the pseudo-random number generator they are repeatable, deterministic, and side-effect free.

This issue falls under “read the manual”, but it is always frustrating when the defaults are not sufficiently generous.

Click through to see the problem and how you can fix it.

Comments closed

The Hype Cycle for Artificial Intelligence

William Vorhies takes a look at Gartner’s hype cycle for AI (among other things):

Supposing you’re a business leader and supposing you’re trying to make an intelligent decision about prioritizing your AI adoption plans.  It’s likely that like many of us the first thing you’d reach for would be one of Gartner’s many hype cycle or magic quadrant analyses.

What you might not know is that you now need an expert just to guide you through the expert literature.  There has been such a proliferation of hype cycles and magic quadrants that you could easily be looking in the wrong place.

The hype cycle is definitely opinion-based, but I think it’s a useful look at the relative maturity of different segments of an industry or technology cluster. Do read the whole thing, though, as these things aren’t perfect.

Comments closed

Converting Odds to Probabilities with R

Jonas Christoffer Lindstrom has a new package:

Now you might think that converting decimal odds to probabilities should be easy, you can just use the definition above and take the inverse of the odds to recover the probability. But it is not that simple, since in practice using this simple formula will give you improper probabilities. They will not sum to 1, as they should, but be slightly larger. This gives the bookmakers an edge and the probabilities (which aren’t real probabilities) can not be considered fair, and so different methods for correcting this exists.

Read on to learn more about the problem and a few solutions. H/T R-Bloggers.

Comments closed

Multi-Armed Bandits

Alex Slivkins has a new book:

If you’ve ever been in a casino, you may have found yourself asking one very pertinent question: On which slot machine am I going to hit the jackpot? Standing in front of a bank of identical-looking machines, you have only instinct to go on. It isn’t until you start putting your money into these one-armed bandits, as they’re also known, that you get a sense of which are hot and which are not, and when you find one that’s paying out regularly, you might stick with it in hopes of winning big. Though seemingly specific to the Las Vegas Strip, this scenario boils down to an exploration-exploitation tradeoff: make a decision based on what you already know and miss out on a potentially bigger reward or spend time and resources continuing to gather information.

Read on for some info about the book. Near the end, Alex gives a link to where you can buy it, as well as where you can get a PDF copy for free.

Comments closed

Security Changes in ML Services

Dennes Torres goes over some of the security changes with Machine Learning Services in SQL Server 2019:

I have a confession to make. Why, in my last article about shortest_path in SQL Server 2019, have I used Gephi in order to illustrate the relationships, instead of using a script in for the same purpose and demonstrate Machine Learning Services as well?

The initial plan was to use an R script; however, the R script which works perfectly in SQL Server 2017 doesn’t work in SQL Server 2019.

The change is a positive one from the standpoint of security, but it also makes life more difficult. I found this particularly tricky when installing TensorFlow and Keras in R via ML Services.

Comments closed

Fun with Regressions and the Zero Line

I have a post covering some important things to keep in mind when reviewing a regression:

The Line is NOT the Data

One of the worst things we can do as data analysts is to interpret a regression line as the most important thing on a visual. The important thing here is the per-state set of data points, but our eyes are drawn to the line. The line mentally replaces the data, but in doing so, we lose the noise. And boy, is there a lot of noise.

This was my first point, but I think it’s the most important one to keep in mind: just because we draw a line and there’s a best fit doesn’t mean that fit is actually any good. And if the fit isn’t any good, the line is…optimistic with regard to how informative it is.

Comments closed

An Overview of Generative Adversarial Networks

Mohammad Waseem takes us through an overview of Generative Adversarial Networks:

Generative models are nothing but those models that use an Unsupervised Learning approach. In a generative model, there are samples in the data i.e input variables X, but it lacks the output variable Y. We use only the input variables to train the generative model and it recognizes patterns from the input variables to generate an output that is unknown and based on the training data only.

In Supervised Learning, we are more aligned towards creating predictive models from the input variables, this type of modeling is known as discriminative modeling. In a classification problem, the model has to discriminate as to which class the example belongs to. On the other hand, unsupervised models are used to create or generate new examples in the input distribution.

To define generative models in layman’s terms we can say, generative models, are able to generate new examples from the sample that are not only similar to other examples but are indistinguishable as well.

Click through for the overview.

Comments closed