What is Machine Learning (ML), and how does it differ from Statistics (and hence, implicitly, from Econometrics)?
Those are big questions, but I think that they’re ones that econometricians should be thinking about. And if I were starting out in Econometrics today, I’d take a long, hard look at what’s going on in ML.
Click through for some quick thoughts and several resources on the topic.
H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, however, the latter follows the conventions of Apache Spark’s MLlib library and allows you to build machine learning pipelines that include MLlib transformers. We will focus on the latter API in this post.
H2OAutoML supports classification and regression. The ML models built and tuned by H2OAutoML include Random Forests, Gradient Boosting Machines, Deep Neural Nets, Generalized Linear Models, and Stacked Ensembles.
The post only has a few lines of code but there are a lot of working parts under the surface.
Last month, I delivered the one-day workshop Practical AI for the Working Software Engineer at the Artificial Intelligence Live conference in Orlando. As the title suggests, the workshop was aimed at developers, bu I didn’t assume any particular programming language background. In addition to the lecture slides, the workshop was delivered as a series of Jupyter notebooks. I ran them using Azure Notebooks (which meant the participants had nothing to install and very little to set up), but you can run them in any Jupyter environment you like, as long as it has access to R and Python. You can download the notebooks and slides from this Github repository (and feedback is welcome there, too).
Read on for details about those notebooks and to get your own copies.
When scoring Python models as Apache Spark UDFs, users can now filter UDF outputs by selecting from an expanded set of result types. For example, specifying a result type of
pyspark.sql.types.DoubleTypefilters the UDF output and returns the first column that contains double precision scalar values. Specifying a result type of
pyspark.sql.types.ArrayType(DoubleType)returns all columns that contain double precision scalar values. The example code below demonstrates result type selection using the
result_typeparameter. And the short example notebook illustrates Spark Model logged and then loaded as a Spark UDF.
Read on for a pretty long list of updates.
This is code that accompanies a book chapter on customer churn that I have written for the German dpunkt Verlag. The book is in German and will probably appear in February: https://www.dpunkt.de/buecher/13208/9783864906107-data-science.html.
The code you find below can be used to recreate all figures and analyses from this book chapter. Because the content is exclusively for the book, my descriptions around the code had to be minimal. But I’m sure, you can get the gist, even without the book. 😉
Click through for the code. This is using the venerable AT&T customer churn data set.
An image data source addresses many of these problems by providing the standard representation you can code against and abstracts from the details of a particular image representation.
Apache Spark 2.3 provided the ImageSchema.readImages API (see Microsoft’s post Image Data Support in Apache Spark), which was originally developed in the MMLSpark library. In Apache Spark 2.4, it’s much easier to use because it is now a built-in data source. Using the image data source, you can load images from directories and get a DataFrame with a single image column.
This blog post describes what an image data source is and demonstrates its use in Deep Learning Pipelines on the Databricks Unified Analytics Platform.
If you’re interested in working with convolutional neural networks or otherwise need to analyze image data, check it out.
What Are Convolutional Neural Networks?
Convolutional Neural Networks, like neural networks, are made up of neurons with learnable weights and biases. Each neuron receives several inputs, takes a weighted sum over them, pass it through an activation function and responds with an output.
The whole network has a loss function and all the tips and tricks that we developed for neural networks still apply on Convolutional Neural Networks.
Pretty straightforward, right?
Neural networks, as its name suggests, is a machine learning technique which is modeled after the brain structure. It comprises of a network of learning units called neurons.
These neurons learn how to convert input signals (e.g. picture of a cat) into corresponding output signals (e.g. the label “cat”), forming the basis of automated recognition.
Let’s take the example of automatic image recognition. The process of determining whether a picture contains a cat involves an activation function. If the picture resembles prior cat images the neurons have seen before, the label “cat” would be activated.
Hence, the more labeled images the neurons are exposed to, the better it learns how to recognize other unlabelled images. We call this the process of training neurons.
I (finally) finished chapter 5 of Deep Learning in R, which is all about CNNs. It’s interesting just how open CNNs are for post hoc understanding, totally at odds with the classic neural network reputation for being a black box full of dark magic.
My partner in crime Serge Luca aka Doctor Flow is the author of a nice and complex expenses approval system in Microsoft Flow .
One year ago, he asked me to add analytics to his Flow. This year he has the interesting idea to add a machine-learning based approval in his flow and suggest me to work on it. The idea is the following: Since we have a lot of approvals in our system, can a machine learn and found some decision pattern to apply automatically to each expenses request ?
I decided to use the Microsoft Azure Machine Learning Studio. In this tool you can build experiments and use some of the most common and useful machine learning algorithms. It was amazing to see how easy it is to create and consume machine learning .
This contrasts with Ginger Grant’s nightmare scenario pretty well: instead of trying to get the ML process to do all of the work, create a process which takes care of the really easy stuff and leave harder tasks to specialists with a deeper understanding of the rules. That way they don’t have to spend their time on trivialities.
Buying a laptop from Lenovo reminded me of an episode of Seinfeld when Elaine was trying to buy soup. For some unknown reason, when I placed an order on their website and gave them my money, Lenovo gave me a Cancellation Notice, the email equivalent of “No Soup for you!” After placing an order, about 15 minutes later, I received a cancellation notice. I called customer service. They looked at the order and advised me the systemincorrectly cancelled the order. I was told to place the order again as they had resolved the problem. I created a new order, and just like the last time, I received the No Laptop for You cancellation email. I called back. This time I was told that the system thinks I am a fraud. Now I have no laptop and I have been insulted.
In all the talk of ML running the future, one thing that gets forgotten is that models, being simplifications of reality, necessarily make mistakes. Failing to have some sort of manual override means, in this case, throwing away money for no good reason.
Reinforcement Learning (RL) is arguably the hottest research area in AI today because it appears RL can be adapted to any problem that has a well-defined reward function. That encompasses game play, robotics, self-driving cars, and frankly pretty much else in machine learning.
Within RL, the hottest research area is Deep RL which means using a deep neural net as the ‘agent’ in the training. Deep RL is seen as the form of RL with the most potential to generalize over the largest number of cases and perhaps the closest we’ve yet come to AGI (artificial general intelligence).
Importantly, Deep RL is also the technique used to win at Alpha Go which brought it huge attention.
The problem is, according to Alex Irpan, a researcher on the Google Brain Robotics team that about 70% of the time they just don’t work.
Alex has written a very comprehensive article critiquing the current state of Deep RL, the field with which he engages on a day-to-day basis. He lays out a whole series of problems and we’ve elected to focus on the three that most clearly illustrate the current state of the problem with notes from his work.
Vorhies is not unduly negative and is optimistic in the medium to long term, but he is right in noting that there is a lot of work yet to do in this field.