Press "Enter" to skip to content

Category: Machine Learning

Scalable Anomaly Detection with Kafka and Cassandra

Paul Brebner wraps up a series on anomaly detection at scale:

The complete machine for the biggest result (48 Cassandra nodes) has 574 cores in total.  This is a lot of cores! Managing the provisioning and monitoring of this sized system by hand would be an enormous effort. With the combination of the Instaclustr managed Cassandra and Kafka clusters (automated provisioning and monitoring), and the Kubernetes (AWS EKS) managed cluster for the application deployment it was straightforward to spin up clusters on demand, run the application for a few hours, and delete the resources when finished for significant cost savings. Monitoring over 100 Pods running the application using the Prometheus Kubernetes operator worked smoothly and gave enhanced visibility into the application and the necessary access to the benchmark metrics for tuning and reporting of results.

The system (irrespective of size) was delivering an approximately constant 400 anomaly checks per second per core.

This is a good summary of what was an interesting series.

Comments closed

Sentiment Analysis with Spark on Qubole

Jonathan Day, et al, have a tutorial on using Qubole to build a sentiment analysis model:

This post covers the use of Qubole, Zeppelin, PySpark, and H2O PySparkling to develop a sentiment analysis model capable of providing real-time alerts on customer product reviews. In particular, this model allows users to monitor any natural language text (such as social media posts or Amazon reviews) and receive alerts when customers post extremely nice (high sentiment) or extremely negative (low sentiment) comments about their products.

In addition to introducing the frameworks used, we will also discuss the concepts of embedding spaces, sentiment analysis, deep neural networks, grid search, stop words, data visualization, and data preparation.

Click through for the demo.

Comments closed

Permissions Requirements for ML Services

Niels Berglund looks at the permissions required to create external libraries with SQL Server Machine Learning Services:

This post is the fourth in a series about installing R packages in SQL Server Machine Learning Services (SQL Server ML Services). To see all posts in the series go to Install R Packages in SQL Server ML Services Series.

Why this series came about is a colleague of mine Dane pinged me and asked if I had any advice as he had issues installing an R package into one of their SQL Server instances. I tried to help him and then thought it would make a good topic for a blog post. Of course, at that time I didn’t think it would be more posts than one, but here we are.

These permissions are a bit more complicated than they might first appear to be.

Comments closed

K-Nearest Neighbors in Python

Hardik Jaroli shows how to use the k-Nearest Neighbors algorithm using scikit-learn:

K Nearest Neighbors is a classification algorithm that operates on a very simple principle. It is best shown through example! Imagine we had some imaginary data on Dogs and Horses, with heights and weights.

Training Algorithm:
1. Store all the Data

Prediction Algorithm:
1.Calculate the distance from x to all points in your data
2. Sort the points in your data by increasing distance from x
3. Predict the majority label of the “k” closest points

Comments closed

Learning with Limited Data

Shioulin Sam and Nisha Muktewar have new research on machine learning when getting labeled data is time-consuming or difficult:

We are excited to release Learning with Limited Labeled Data, the latest report and prototype from Cloudera Fast Forward Labs.

Being able to learn with limited labeled data relaxes the stringent labeled data requirement for supervised machine learning. Our report focuses on active learning, a technique that relies on collaboration between machines and humans to label smartly.

Active learning makes it possible to build applications using a small set of labeled data, and enables enterprises to leverage their large pools of unlabeled data. In this blog post, we explore how active learning works. (For a higher level introduction, please see our previous blogpost.

The research itself is behind a paywall but you can see their write-up to get an idea of the topic.

Comments closed

Python Natural Language Processing Tools

Sandeep Aspari takes us through some of the tooling available in Python around Natural Language Processing:

TextBlob
TextBlob is a python library tool and extension of NLTK. It provides a simple API approach to its methods and executes a large number of NLTK functions, and it also includes the pattern library functionality. You are just at the beginning, this might be an excellent tool to learning, and we can use it in applications production those don’t require heavy performant. TextBlob libraries are similar to python strings, so we can quickly transform and play similarly we performed in python. Finally, TextBlob is used in everywhere, and it is best suitable for smaller projects.

There are several tools from which you can choose. Sandeep also gives us some Node- and Java-based tools as well.

Comments closed

Power BI AutoML

Teo Lachev takes a look at AutoML in Power BI:

Let’s see how AutoML works based on what’s in the private preview (the usual disclaimer is that things will probably change). To start with, AutoML requires a dataflow (a note to Microsoft here is that AutoML will become more pervasive if it’s available in Power BI Desktop and it doesn’t require a premium capacity). In the private preview, AutoML requires the following steps. Presumably. the first (and most difficult step), preparing the dataset and cleansing the data is already done and available as a dataflow entity:

It looks like Microsoft’s taking what they learned from Azure ML and trying to port it over to Power BI.

Comments closed

Using Convolutional Neural Networks To Recognize Features In Images

Michael Grogan shows how you can use Keras to perform image recognition with a convolutional neural network:

VGG16 is a built-in neural network in Keras that is pre-trained for image recognition.

Technically, it is possible to gather training and test data independently to build the classifier. However, this would necessitate at least 1,000 images, with 10,000 or greater being preferable.

In this regard, it is much easier to use a pre-trained neural network that has already been designed for image classification purposes.

This is probably the best generally available technique for image classification.

Comments closed

Native Math Libraries And Spark ML

Zuling Kang shares with us how we can use native math libraries in netlib-java to speed up certain machine learning algorithms in Apache Spark:

Spark’s MLlib uses the Breeze linear algebra package, which depends on netlib-java for optimized numerical processing.  netlib-java is a wrapper for low-level BLASLAPACK, and ARPACK libraries. However, due to licensing issues with runtime proprietary binaries, neither the Cloudera distribution of Spark nor the community version of Apache Spark includes the netlib-java native proxies by default. So without manual configuration, netlib-java only uses the F2J library, a Java-based math library that is translated from Fortran77 reference source code.

To check whether you are using native math libraries in Spark ML or the Java-based F2J, use the Spark shell to load and print the implementation library of netlib-java. The following commands return information on the BLAS library and include that it is using F2J in the line, “com.github.fommil.netlib.F2jBLAS,” which is highlighted below:

In the examples here, you can get about a 2x difference using the native math libraries versus without, so although that’s not an order of magnitude difference, it’s still nothing to sneeze at.

Comments closed

No-Code ML On Cloudera Data Science Workbench

Tim Spann has a post covering ML on the Cloudera Data Science Workbench:

Using Cloudera Data Science Workbench with Apache NiFi, we can easily call functions within our deployed models from Apache NiFi as part of flows. I am working against CDSW on HDP (https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_hdp.html),  but it will work for all CDSW regardless of install type.
In my simple example, I built a Python model that uses TextBlob to run sentiment analysis against a passed-in sentence. It returns Sentiment Polarity and Subjectivity, which we can immediately act upon in our flow.
CDSW is extremely easy to work with and I was up and running in a few minutes. For my model, I created a python 3 script and a shell script for install details. Both of these artifacts are available here: https://github.com/tspannhw/nifi-cdsw.

The “no code” portion was less interesting to me than the scalable ML portion, as “no code” either drops into tedium or ends up being replaced by code.

Comments closed