Category: Machine Learning

Power BI AutoML

Published 2019-03-06 by Kevin Feasel

Teo Lachev takes a look at AutoML in Power BI:

Let’s see how AutoML works based on what’s in the private preview (the usual disclaimer is that things will probably change). To start with, AutoML requires a dataflow (a note to Microsoft here is that AutoML will become more pervasive if it’s available in Power BI Desktop and it doesn’t require a premium capacity). In the private preview, AutoML requires the following steps. Presumably. the first (and most difficult step), preparing the dataset and cleansing the data is already done and available as a dataflow entity:

It looks like Microsoft’s taking what they learned from Azure ML and trying to port it over to Power BI.

Comments closed

Using Convolutional Neural Networks To Recognize Features In Images

Published 2019-02-19 by Kevin Feasel

Michael Grogan shows how you can use Keras to perform image recognition with a convolutional neural network:

VGG16 is a built-in neural network in Keras that is pre-trained for image recognition.
Technically, it is possible to gather training and test data independently to build the classifier. However, this would necessitate at least 1,000 images, with 10,000 or greater being preferable.
In this regard, it is much easier to use a pre-trained neural network that has already been designed for image classification purposes.

This is probably the best generally available technique for image classification.

Comments closed

Native Math Libraries And Spark ML

Published 2019-02-14 by Kevin Feasel

Zuling Kang shares with us how we can use native math libraries in netlib-java to speed up certain machine learning algorithms in Apache Spark:

Spark’s MLlib uses the Breeze linear algebra package, which depends on netlib-java for optimized numerical processing. netlib-java is a wrapper for low-level BLAS, LAPACK, and ARPACK libraries. However, due to licensing issues with runtime proprietary binaries, neither the Cloudera distribution of Spark nor the community version of Apache Spark includes the netlib-java native proxies by default. So without manual configuration, netlib-java only uses the F2J library, a Java-based math library that is translated from Fortran77 reference source code.
To check whether you are using native math libraries in Spark ML or the Java-based F2J, use the Spark shell to load and print the implementation library of netlib-java. The following commands return information on the BLAS library and include that it is using F2J in the line, “com.github.fommil.netlib.F2jBLAS,” which is highlighted below:

In the examples here, you can get about a 2x difference using the native math libraries versus without, so although that’s not an order of magnitude difference, it’s still nothing to sneeze at.

Comments closed

No-Code ML On Cloudera Data Science Workbench

Published 2019-02-12 by Kevin Feasel

Tim Spann has a post covering ML on the Cloudera Data Science Workbench:

Using Cloudera Data Science Workbench with Apache NiFi, we can easily call functions within our deployed models from Apache NiFi as part of flows. I am working against CDSW on HDP (https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_hdp.html), but it will work for all CDSW regardless of install type.
In my simple example, I built a Python model that uses TextBlob to run sentiment analysis against a passed-in sentence. It returns Sentiment Polarity and Subjectivity, which we can immediately act upon in our flow.
CDSW is extremely easy to work with and I was up and running in a few minutes. For my model, I created a python 3 script and a shell script for install details. Both of these artifacts are available here: https://github.com/tspannhw/nifi-cdsw.

The “no code” portion was less interesting to me than the scalable ML portion, as “no code” either drops into tedium or ends up being replaced by code.

Comments closed

codecentric.ai Bootcamp

Published 2019-02-11 by Kevin Feasel

Shirin Glander announces a free German-language bootcamp:

This bootcamp is a free online course for everyone who wants to learn hands-on machine learning and AI techniques, from basic algorithms to deep learning, computer vision and NLP. However, the course language is German only, but for every chapter I did, you will find an English R-version here on my blog (see below for links).
Right now, the course is in beta phase, so we are happy about everyone who tests our content and leaves feedback. Also, not the entire curriculum is finished yet, we will update and extend the course during the next months. If there are specific topics you’d like to have us cover, just let us know!

If you understand German and want to learn about data science, check this out and leave feedback.

Comments closed

Gartner Advanced Analytics Magic Quadrant Updates

Published 2019-02-05 by Kevin Feasel

William Vorhies summarizes the changes to the Gartner Advanced Analytics magic quadrant:

The Gartner Magic Quadrant for Data Science and Machine Learning Platforms is just out and once again there are big changes in the leaderboard. Say what you will about our profession but as a platform developer you certainly can’t rest on your laurels. Some traditional leaders have fallen (SAS, KNIME, H2Oai, IBM) and some challengers have risen (Alteryx, TIBCO, RapidMiner).

Databricks is making a big push and there’s more movement than usual in this year’s chart. Check it out.

Comments closed

Preparing Text Data For Natural Language Processing

Published 2019-01-30 by Kevin Feasel

Shirin Glander takes us through the process of preparing natural language data for machine learning using Keras:

As with any neural network, we need to convert our data into a numeric format; in Keras and TensorFlow we work with tensors. The IMDB example data from the keras package has been preprocessed to a list of integers, where every integer corresponds to a word arranged by descending word frequency.
So, how do we make it from raw text to such a list of integers? Luckily, Keras offers a few convenience functions that make our lives much easier.

This is a very nice tutorial if you’re new to the process.

Comments closed

Analytical Pipelines In R With H2O And AWS

Published 2019-01-29 by Kevin Feasel

Hanjo Oden wraps up a series on training models on AWS using H2O in R:

To generate these, you can log into your AWS dashboard, go to the IAM (Identity and Access Management) dashboard and select the Users tab. On the Userstab, add a user and also the administration rights that you want the user to have.Remember to restart R once you have filled in the access key information in the .Renviron file for it to take effect.
At this point, those familiar with cloudyr suite is probably asking – “This is exactly the same as library(aws.ec2), so why use boto3?“. Well, to be honest, I was using aws.ec2 for a while, but I find spot-instances, which the current version of aws.ec2 does not support. In addition I found that boto3 has some other functionalitue – which I prefer. For a full list of boto3 functions to interact with an EC2 instance, have a look at the reference manual.

It’s pretty good stuff; check it out.

Comments closed

Genetic Algorithms In R

Published 2019-01-24 by Kevin Feasel

Pablo Casas touches on one of my favorite lost causes:

In machine learning, one of the uses of genetic algorithms is to pick up the right number of variables in order to create a predictive model.
To pick up the right subset of variables is a problem of combinatory logic and optimization.
The advantage of this technique over others is that it allows the best solution to emerge from the best of the prior solutions. An evolutionary algorithm which improves the selection over time.
The idea of GA is to combine the different solutions generation after generation to extract the best genes (variables) from each one. That way it creates new and more fit individuals.
We can find other uses of GA such as hyper-tunning parameters, finding the maximum (or minimum) of a function, or searching for the correct neural network architecture (neuroevolution), among others.

I’ve seen a few people use genetic algorithms in the past decade, but usually for hyperparameter tuning rather than as a primary algorithm. It was always the “algorithm of last resort” even before neural networks took over the industry, but if you want to spend way too much time on the topic, I have a series. If you have too much time on your hands and meet me in person, ask about my thesis.

Comments closed

Adding ML Services On Windows Server Core

Published 2019-01-23 by Kevin Feasel

Kevin Chant shows us how to add SQL Server ML Services to an already-existing SQL Server installation on Windows Server Core:

It’s important to try and use an install set that is the same level of Service pack as your current install. Otherwise, you could end up installing multiple patches to get the SQL Launchpad service to work. Which is something discussed in a previous post here.
I know some companies have a central installer for SQL Server and then have all the updates in another location. Hence, if you are in such an environment be prepared to run multiple updates from that location after the install.

This is definitely one of the features which is easier to install from the beginning than to install after the fact.

Comments closed