Spark’s MLlib uses the Breeze linear algebra package, which depends on netlib-java for optimized numerical processing. netlib-java is a wrapper for low-level BLAS, LAPACK, and ARPACK libraries. However, due to licensing issues with runtime proprietary binaries, neither the Cloudera distribution of Spark nor the community version of Apache Spark includes the netlib-java native proxies by default. So without manual configuration, netlib-java only uses the F2J library, a Java-based math library that is translated from Fortran77 reference source code.
To check whether you are using native math libraries in Spark ML or the Java-based F2J, use the Spark shell to load and print the implementation library of netlib-java. The following commands return information on the BLAS library and include that it is using F2J in the line, “com.github.fommil.netlib.F2jBLAS,” which is highlighted below:
In the examples here, you can get about a 2x difference using the native math libraries versus without, so although that’s not an order of magnitude difference, it’s still nothing to sneeze at.
Using Cloudera Data Science Workbench with Apache NiFi, we can easily call functions within our deployed models from Apache NiFi as part of flows. I am working against CDSW on HDP (https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_hdp.html), but it will work for all CDSW regardless of install type.
In my simple example, I built a Python model that uses TextBlob to run sentiment analysis against a passed-in sentence. It returns Sentiment Polarity and Subjectivity, which we can immediately act upon in our flow.
CDSW is extremely easy to work with and I was up and running in a few minutes. For my model, I created a python 3 script and a shell script for install details. Both of these artifacts are available here: https://github.com/tspannhw/nifi-cdsw.
The “no code” portion was less interesting to me than the scalable ML portion, as “no code” either drops into tedium or ends up being replaced by code.
This bootcamp is a free online course for everyone who wants to learn hands-on machine learning and AI techniques, from basic algorithms to deep learning, computer vision and NLP. However, the course language is German only, but for every chapter I did, you will find an English R-version here on my blog (see below for links).
Right now, the course is in beta phase, so we are happy about everyone who tests our content and leaves feedback. Also, not the entire curriculum is finished yet, we will update and extend the course during the next months. If there are specific topics you’d like to have us cover, just let us know!
If you understand German and want to learn about data science, check this out and leave feedback.
The Gartner Magic Quadrant for Data Science and Machine Learning Platforms is just out and once again there are big changes in the leaderboard. Say what you will about our profession but as a platform developer you certainly can’t rest on your laurels. Some traditional leaders have fallen (SAS, KNIME, H2Oai, IBM) and some challengers have risen (Alteryx, TIBCO, RapidMiner).
Databricks is making a big push and there’s more movement than usual in this year’s chart. Check it out.
As with any neural network, we need to convert our data into a numeric format; in Keras and TensorFlow we work with tensors. The IMDB example data from the
keraspackage has been preprocessed to a list of integers, where every integer corresponds to a word arranged by descending word frequency.
So, how do we make it from raw text to such a list of integers? Luckily, Keras offers a few convenience functions that make our lives much easier.
This is a very nice tutorial if you’re new to the process.
To generate these, you can log into your AWS dashboard, go to the IAM (Identity and Access Management) dashboard and select the
Userstab. On the
Userstab, add a user and also the administration rights that you want the user to have.Remember to restart R once you have filled in the access key information in the .Renviron file for it to take effect.
At this point, those familiar with
cloudyrsuite is probably asking – “This is exactly the same as
library(aws.ec2), so why use boto3?“. Well, to be honest, I was using
aws.ec2for a while, but I find spot-instances, which the current version of
aws.ec2does not support. In addition I found that
boto3has some other functionalitue – which I prefer. For a full list of
boto3functions to interact with an EC2 instance, have a look at the reference manual.
It’s pretty good stuff; check it out.
In machine learning, one of the uses of genetic algorithms is to pick up the right number of variables in order to create a predictive model.
To pick up the right subset of variables is a problem of combinatory logic and optimization.
The advantage of this technique over others is that it allows the best solution to emerge from the best of the prior solutions. An evolutionary algorithm which improves the selection over time.
The idea of GA is to combine the different solutions generation after generation to extract the best genes (variables) from each one. That way it creates new and more fit individuals.
We can find other uses of GA such as hyper-tunning parameters, finding the maximum (or minimum) of a function, or searching for the correct neural network architecture (neuroevolution), among others.
I’ve seen a few people use genetic algorithms in the past decade, but usually for hyperparameter tuning rather than as a primary algorithm. It was always the “algorithm of last resort” even before neural networks took over the industry, but if you want to spend way too much time on the topic, I have a series. If you have too much time on your hands and meet me in person, ask about my thesis.
It’s important to try and use an install set that is the same level of Service pack as your current install. Otherwise, you could end up installing multiple patches to get the SQL Launchpad service to work. Which is something discussed in a previous post here.
I know some companies have a central installer for SQL Server and then have all the updates in another location. Hence, if you are in such an environment be prepared to run multiple updates from that location after the install.
This is definitely one of the features which is easier to install from the beginning than to install after the fact.
Convolutional Neural Nets are usually abbreviated either CNNs or ConvNets. They are a specific type of neural network that has very particular differences compared to MLPs. Basically, you can think of CNNs as working similarly to the receptive fields of photoreceptors in the human eye. Receptive fields in our eyes are small connected areas on the retina where groups of many photo-receptors stimulate much fewer ganglion cells. Thus, each ganglion cell can be stimulated by a large number of receptors, so that a complex input is condensed into a compressed output before it is further processed in the brain.
If you’re interested in understanding why a CNN will classify the way it does, chapter 5 of Deep Learning with R is a great reference.
What is Machine Learning (ML), and how does it differ from Statistics (and hence, implicitly, from Econometrics)?
Those are big questions, but I think that they’re ones that econometricians should be thinking about. And if I were starting out in Econometrics today, I’d take a long, hard look at what’s going on in ML.
Click through for some quick thoughts and several resources on the topic.