Category: Data Science

Today many data science (DS) organizations are accelerating the agile analytics development process using Databricks notebooks. Fully leveraging the distributed computing power of Apache Spark™, these organizations are able to interact easily with data at multi-terabytes scale, from exploration to fast prototype and all the way to productionize sophisticated machine learning (ML) models. As fast iteration is achieved at high velocity, what has become increasingly evident is that it is non-trivial to manage the DS life cycle for efficiency, reproducibility, and high-quality. The challenge multiplies in large enterprises where data volume grows exponentially, the expectation of ROI is high on getting business value from data, and cross-functional collaborations are common.
In this blog, we introduce a joint work with Iterable that hardens the DS process with best practices from software development. This approach automates building, testing, and deployment of DS workflow from inside Databricks notebooks and integrates fully with MLflow and Databricks CLI. It enables proper version control and comprehensive logging of important metrics, including functional and integration tests, model performance metrics, and data lineage. All of these are achieved without the need to maintain a separate build server.

Read on to see how.

Comments closed

Cluster-Based Image Analysis and Reduction

Published 2020-01-16 by Kevin Feasel

Sebastian Sauer takes an image and reduces it to a group of colors:

This post is a remake of this casestudy: https://fallstudien.netlify.com/fallstudie_bildanalyse/bildanalyse
brought to you by Karsten Lübke.
The main purpose is to replace the base R command that Karsten used with a more tidyverse-friendly style. I think that’s easier (for me).
We will compute a cluster analysis to find the typical RGB color per cluster.

Click through for quite a bit of R code and a couple interesting turns.

Comments closed

Differences Between Kaggle and Real Life

Published 2020-01-13 by Kevin Feasel

Sergii Makarevych explains the differences between a Kaggle competition and a business-world data science project:

There are some very important differences between a Kaggle competition and real-life project which beginner Data Scientists should know about. Kaggle creates a fantastic competition spirit. Its leaderboard drives people to deliver better and better solutions pushing accuracy to the limit. Kaggle’s Notebooks and Discussions make it easy to share knowledge and learn. However real-life projects are somewhat different. I hope this article will be helpful for people who consider moving into Data Science starting with Kaggle competitions. I remember I was a little bit overwhelmed when on my first real-life project all the models, that typically worked well on Kaggle, miserably failed. I wish I was prepared for this.

It’s a sensible list of differences. Kaggle emphasizes one part of the data science process, but businesses end up needing the whole thing.

Comments closed

Evaluating Classification Models

Published 2020-01-03 by Kevin Feasel

Dan Fitton takes us through some of the useful techniques and measures for evaluating classification models:

The confusion matrix is perhaps the most important thing to look at when evaluating a classification model. It contains a large amount of insight for such a small sized table. Despite its name, the confusion matrix is actually quite simple. It is a matrix that visualises the count of actual class instances against predicted class instances. This allows you to quickly see the amount of correct and incorrect predictions for each category, and whether any bias exists, and if so, where it is.

The example is specifically around Azure ML, but applies across the board. I think people get a little bit too hung up on accuracy and forget about important measures like positive and negative predictive value.

Comments closed

NFL Kicker Quality

Published 2019-12-11 by Kevin Feasel

Jacob Long has an outstanding pair of posts on evaluating kickers in the NFL. FIrst up is the analysis itself:

Justin Tucker is so great that, quite frankly, it doesn’t matter which metric you use. PAA, FG% – eFG%, or just plain old FG%, he’s unlike anyone else in the past 10 years. Given the well-documented trend of increasing kicker accuracy in the NFL, I think Tucker has a solid claim on being the greatest kicker of all time.
Even with fewer seasons than many of his competitors, his PAA are double all the others who kicked in the past 10 years. He had a slightly more difficult than average set of attempts but made a higher percentage of his attempts than anyone who has had more than 22 tries. Good luck trying to find any defect in Tucker’s record.

Jacob then covers the method in detail:

Pasteur and Cunningham-Rhoads — I’ll refer to them as PC-R for short — gathered more data than most predecessors, particularly in terms of auxiliary environmental info. They have wind, temperature, and presence/absence of precipitation. They show fairly convincingly that while modeling kick distance is the most important thing, these other factors are important as well. PC-R also find the cardinal direction of every NFL stadium (i.e., does it run north-south, east-west, etc.) and use this information along with wind direction data to assess the presence of cross-winds, which are perhaps the trickiest for kickers to deal with. They can’t know about headwinds/tailwinds because as far as they (and I) can tell, nobody bothers to record which end zone teams defend at the game’s coin toss, so we don’t know without looking at video which direction the kick is going. They ultimately combine the total wind and the cross wind, suggesting they have some meaningful measurement error that makes them not accurately capture all the cross-winds. Using their logistic regressions that factor for these several factors, they calculate an eFG% and use it and its derivatives to rank the kickers.

Those wind factors make certain stadiums like New Era Field (where Buffalo plays) tricky: it’s fun to see two flags right next to each other pointing in opposite directions, or the flags on the field goal posts pointing hard right, then switching to hard left, then switching back to hard right over the course of a field goal try. H/T R-Bloggers

Comments closed

Time Series Anomaly Detection with Power BI

Published 2019-12-09 by Kevin Feasel

Leila Etaati takes us through time series anomaly detection with Cognitive Services and Power Query:

I am excited about this blog post, this is based on the New service in Cognitive Service name “Anomaly Detection” which is now in Preview.
I recorded a video about how it works in cognitive service https://youtu.be/7ZOtZDbn6gM.
However, I am going to talk about how to use it in Power BI. In this post first, a brief introduction to the anomaly detection will be presented, then how it can be used inside Power BI will be discussed.

It sounds like there are still some rough edges, but they already have the makings of an interesting service.

Comments closed

Automated ML Pipelines with SAS

Published 2019-12-04 by Kevin Feasel

Sophia Rowland shows off SAS’s auto-ML action:

The dsAutoMl action does it all. It will explore your data, generate features, select features, create models, and autotune the hyper-parameters of those models. This action includes the four policies we have seen in my first two blogs: explorationPolicy, screenPolicy, transformationPolicy, and selectionPolicy. Please review my previous blogs if you need a refresher on the data exploration and cleaning process or feature generation and selection process. The dsAutoMl action builds on our prior discussions through model generation and autotuning. A data scientist can choose to build several models such as decision trees, random forests, gradient boosting models, and neural networks. In addition, the data scientist can control which objective function to optimize for and the number of K-folds to use. The output of the dsAutoMl action includes information about the features generated, information on the model pipelines generated, and an analytic store file for generating the features with new data.

This is an area where several companies are investing a lot of money, trying to simplify the process of training models.

Comments closed

The Joy of Decision Trees

Published 2019-11-26 by Kevin Feasel

Tom Jordan explains how a simple set of “if” statements forms the basis of some powerful data science algorithms:

While it is true there are techniques in machine learning that required advanced maths knowledge, some of the most widely used approaches make use of knowledge given to every child at secondary school. The line of best fit, drawn by many a student in Year 8 Chemistry, can also be known by its alter-ego, linear regression, and see applications all over machine learning. Neural networks, central to some of the most cutting-edge applications, are formed of simple mathematical models consisting of some addition and multiplication.
A personal favourite technique, and the subject of this blog, is the humble decision tree, taught in schools all over the country. This blog will take a high-level look at the theory around decision trees, an extension using random forests, and the real-world applications of these techniques.

Read on for more.

Comments closed

Text Processing Tools and Methods

Published 2019-11-20 by Kevin Feasel

Ines Roldos takes us through several tools and techniques used in text processing:

Text processing is the process of analyzing and manipulating textual information. This includes extracting smaller bits of information from text (aka text extraction), assign values or tags depending on its content (aka text classification), or performing calculations that depend on the textual information.
Since we naturally communicate in words, not numbers, companies receive a lot of raw text data via emails, chat conversations, social media, and other channels. This unstructured data is filled with insights and opinions about different topics, products, and services, but companies first need to organize, sort, and measure textual data to get access to this valuable information. One way to process text data is manually, which has been the most popular method – up until now.

We’re still in the early days of text processing, but there have been some nice improvements over the past decade.

Comments closed

Cross-Validation Versus Regularization

Published 2019-11-18 by Kevin Feasel

Nina Zumel takes us through a pair of techniques for avoiding overfitting:

Cross-validation is relatively computationally expensive; regularization is relatively cheap. Can you mitigate nested model bias by using regularization techniques instead of cross-validation?
The short answer: no, you shouldn’t. But as, we’ve written before, demonstrating this is more memorable than simply saying “Don’t do that.”

Definitely worth the read.

Comments closed