Press "Enter" to skip to content

Category: Data Science

Machine Learning From Kafka

Kai Waehner has a post covering a recent talk he did on using Kafka as a data source for neural networks:

This talk shows how to build Machine Learning models at extreme scale and how to productionize the built models in mission-critical real time applications by leveraging open source components in the public cloud. The session discusses the relation between TensorFlow and the Apache Kafka ecosystem – and why this is a great fit for machine learning at extreme scale.

The Machine Learning architecture includes: Kafka Connect for continuous high volume data ingestion into the public cloud, TensorFlow leveraging Deep Learning algorithms to build an analytic model on powerful GPUs, Kafka Streams for model deployment and inference in real time, and KSQL for real time analytics of predictions, alerts and model accuracy.

Sensor analytics for predictive alerting in real time is used as real world example from Internet of Things scenarios. A live demo shows the out-of-the-box integration and dynamic scalability of these components on Google Cloud.

Check out the slide deck as well for more details.

Comments closed

Overfitting With Polynomial Regression

Vincent Granville shows us a few problems with polynomial regression:

Even if the function to be estimated is very smooth, due to machine precision, only the first three or four coefficients can be accurately computed. With infinite precision, all coefficients would be correctly computed without over-fitting. We first explore this problem from a mathematical point of view in the next section, then provide recommendations for practical model implementations in the last section.

This is also a good read for professionals with a math background interested in learning more about data science, as we start with some simple math, then discuss how it relates to data science. Also, this is an original article, not something you will learn in college classes or data camps, and it even features the solution to a linear regression involving an infinite number of variables.

Granville’s point that overfitting is a relatively small concern is rather interesting.  But the advice to avoid polynomial regression is generally pretty solid.

Comments closed

Relationships Between Numerical Features

Stacia Varga continues her exploratory data analysis series using hockey data:

Let’s start with something easy and understandable to analyze. If I put age on the horizontal axis and weight on the vertical axis. It’s a common practice to put an explanatory variable on the horizontal axis and a response variable on the vertical axis. In other words, I’m looking to see how an increase in age (explanation) affects – or not – weight (response) for all the hockey players in the current season, regardless of team.

If I put age on the horizontal axis – does this explain weight? Sort of – the combinations of age and weight have some groupings. It almost appears that there is a greater number of younger, heavier players than older, heavier players, but it’s hard to tell here how the age/weight combinations are distributed because I can’t see all the individual points.

Read the whole thing, while keeping in mind that correlation does not imply causation.

Comments closed

Toward Interpretable Machine Learning

Cristoph Molnar shows off a couple of R packages which help interpret ML models:

Machine learning models repeatedly outperform interpretable, parametric models like the linear regression model. The gains in performance have a price: The models operate as black boxes which are not interpretable.

Fortunately, there are many methods that can make machine learning models interpretable. The R package imlprovides tools for analysing any black box machine learning model:

  • Feature importance: Which were the most important features?
  • Feature effects: How does a feature influence the prediction? (Partial dependence plots and individual conditional expectation curves)
  • Explanations for single predictions: How did the feature values of a single data point affect its prediction? (LIME and Shapley value)
  • Surrogate trees: Can we approximate the underlying black box model with a short decision tree?
  • The iml package works for any classification and regression machine learning model: random forests, linear models, neural networks, xgboost, etc.

This is a must-read if you’re getting into model-building. H/T R-Bloggers

Comments closed

Tidy Anomaly Detection With Anomalize

Abdul Majed Raja walks us through an example using the anomalize package:

One of the important things to do with Time Series data before starting with Time Series forecasting or Modelling is Time Series Decomposition where the Time series data is decomposed into Seasonal, Trend and remainder components. anomalize has got a function time_decompose() to perform the same. Once the components are decomposed, anomalize can detect and flag anomalies in the decomposed data of the reminder component which then could be visualized with plot_anomaly_decomposition() .

btc_ts %>% 
  time_decompose(Price, method = "stl", frequency = "auto", trend = "auto") %>%
  anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.2) %>%
  plot_anomaly_decomposition()

As you can see from the above code, the decomposition happens based on ‘stl’ method which is the common method of time series decomposition but if you have been using Twitter’s AnomalyDetection, then the same can be implemented in anomalize by combining time_decompose(method = “twitter”) with anomalize(method = "gesd"). Also the ‘stl’ method of decomposition can also be combined with anomalize(method = "iqr") for a different IQR based anomaly detection.

Read on to see what else you can do with anomalize.

Comments closed

Uploading Data Sets To Azure ML From R

Leila Etaati continues her series on the Azure ML R package by showing how to upload a data set:

There is a function in AzureML package name “workspace” that creates a reference to an AzureML Studio workspace by getting the authentication token and workspace id as below:

to work with other AzureML packages you need to pass this object to them.

for instance for exploring the all experiments in Azure ML there is a function name “experiments” that gets the “ws” object as input to connect the desire azure ml environment and also a filter.

Click through for  more.

Comments closed

The Theory Behind ARIMA

Bidyut Ghosh explains how the ARIMA forecasting method works:

The earlier models of time series are based on the assumptions that the time series variable is stationary (at least in the weak sense).

But in practical, most of the time series variables will be non-stationary in nature and they are intergrated series.

This implies that you need to take either the first or second difference of the non-stationary time series to convert them into stationary.

Bidyut ends with a little bit of implementation in R, but I’d guess that’ll be the focus of part 2.

Comments closed

Building A Model: Lumping And Splitting

Anna Schneider and Alex Smolyanskaya explain some of the tradeoffs between lumping groups together and splitting them out when it comes to algorithm selection:

At Stitch Fix, individual personalization is in our DNA: every client is unique and every piece of clothing we send is chosen to be just right. When we buy merchandise, we could choose to lump clients together; algorithms trained on lumped data would steer us toward that little black dress or those comfy leggings that delight a core, modal group of clients. Yet when we split clients into narrower segments and focus on the tails of the distribution, the algorithms have the chance to also tease out that sleek pinstripe blazer or that pair of distressed teal jeans that aren’t right for everyone, but just right for someone. As long as we don’t split our clients so finely that we’re in danger of overfitting, and as long as humans can still understand the algorithm’s recommendations, splitting is the way to go.

In other cases lumping can provide action-oriented clarity for human decision-makers. For example, we might lump clients into larger groups when reporting on business growth for a crisp understanding of holistic business health, even if our models forecast that growth at the level of finer client splits.

Read on and check out their useful chart for figuring out whether lumping or splitting is the better idea for you.

Comments closed

Precision And Recall

Brian Lee Yung Rowe makes the important point that model accuracy is not always the ultimate measure:

Now, AI companies are obliged to tell you how great their model is. They may say something like “our model is 95% accurate”. Zowee! But what does this mean exactly? In terms of binary classification it means that the model chose the correct class 95% of the time. This seems pretty good, so what’s the problem?

Suppose I create an AI that guesses the gender of a technical employee at Facbook. As of 2017, 19% of STEM roles are held by women. Behind the scenes, my model is really simple: it just chooses male every time (bonus question: is this AI?). Because of the data, my model will be 81% accurate. Now 95% doesn’t seem all that impressive. This dataset is known to be unbalanced, because the classes are not proportional. A better dataset would have about 50% women and 50% men. So asking if a dataset is balanced helps to identify some tricks that make models appear smarter than they are.

With wildly unbalanced data (like diagnosing rare diseases), measures like positive predictive value are far more important than overall accuracy.

Comments closed

A Basic Explanation Of Associative Rule Learing

Akshansh Jain has some notes on associative rules:

Support tells us that how frequent is an item, or an itemset, in all of the data. It basically tells us how popular an itemset is in the given dataset. For example, in the above-given dataset, if we look at Learning Spark, we can calculate its support by taking the number of transactions in which it has occurred and dividing it by the total number of transactions.

Support{Learning Spark} = 4/5
Support{Programming in Scala} = 2/5
Support{Learning Spark, Programming in Scala} = 1/5

Support tells us how important or interesting an itemset is, based on its number of occurrences. This is an important measure, as in real data there are millions and billions of records, and working on every itemset is pointless, as in millions of purchases if a user buys Programming in Scala and a cooking book, it would be of no interest to us.

Read the whole thing.

Comments closed