Where Machine Learning And Econometrics Collide

Dave Giles shares some thoughts on how machine learning and econometrics relate:

What is Machine Learning (ML), and how does it differ from Statistics (and hence, implicitly, from Econometrics)?

Those are big questions, but I think that they’re ones that econometricians should be thinking about. And if I were starting out in Econometrics today, I’d take a long, hard look at what’s going on in ML.

Click through for some quick thoughts and several resources on the topic.

Auto ML With SQL Server 2019 Big Data Clusters

Marco Inchiosa has a model scenario for using Big Data Clusters to scale out a machine learning problem:

H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, however, the latter follows the conventions of Apache Spark’s MLlib library and allows you to build machine learning pipelines that include MLlib transformers. We will focus on the latter API in this post.

H2OAutoML supports classification and regression. The ML models built and tuned by H2OAutoML include Random Forests, Gradient Boosting Machines, Deep Neural Nets, Generalized Linear Models, and Stacked Ensembles.

The post only has a few lines of code but there are a lot of working parts under the surface.

Practical AI Workshop Notebooks

David Smith has published a set of notebooks from the Practical AI for the Working Software Engineer workshop:

Last month, I delivered the one-day workshop Practical AI for the Working Software Engineer at the Artificial Intelligence Live conference in Orlando. As the title suggests, the workshop was aimed at developers, bu I didn’t assume any particular programming language background. In addition to the lecture slides, the workshop was delivered as a series of Jupyter notebooks. I ran them using Azure Notebooks (which meant the participants had nothing to install and very little to set up), but you can run them in any Jupyter environment you like, as long as it has access to R and Python. You can download the notebooks and slides from this Github repository (and feedback is welcome there, too). 

Read on for details about those notebooks and to get your own copies.

MLflow 0.8.1 Released

Aaron Davidson, et al, announce a new version of Databricks MLflow:

When scoring Python models as Apache Spark UDFs, users can now filter UDF outputs by selecting from an expanded set of result types. For example, specifying a result type of pyspark.sql.types.DoubleType filters the UDF output and returns the first column that contains double precision scalar values. Specifying a result type of pyspark.sql.types.ArrayType(DoubleType) returns all columns that contain double precision scalar values. The example code below demonstrates result type selection using the result_type parameter. And the short example notebook illustrates Spark Model logged and then loaded as a Spark UDF.

Read on for a pretty long list of updates.

Analyzing Customer Churn With Keras And H2O

Shirin Glander has released code pertaining to a forthcoming book chapter:

This is code that accompanies a book chapter on customer churn that I have written for the German dpunkt Verlag. The book is in German and will probably appear in February: https://www.dpunkt.de/buecher/13208/9783864906107-data-science.html.
The code you find below can be used to recreate all figures and analyses from this book chapter. Because the content is exclusively for the book, my descriptions around the code had to be minimal. But I’m sure, you can get the gist, even without the book. 😉

Click through for the code.  This is using the venerable AT&T customer churn data set.

Working With Images In Spark 2.4

Tomas Nykodym and Weichen Xu give us an update on working with images in the most recent version of Apache Spark:

An image data source addresses many of these problems by providing the standard representation you can code against and abstracts from the details of a particular image representation.
Apache Spark 2.3 provided the ImageSchema.readImages API (see Microsoft’s post Image Data Support in Apache Spark), which was originally developed in the MMLSpark library. In Apache Spark 2.4, it’s much easier to use because it is now a built-in data source. Using the image data source, you can load images from directories and get a DataFrame with a single image column.
This blog post describes what an image data source is and demonstrates its use in Deep Learning Pipelines on the Databricks Unified Analytics Platform.

If you’re interested in working with convolutional neural networks or otherwise need to analyze image data, check it out.

Building A Convolutional Neural Network With TensorFlow

Anirudh Rao walks us through Convolutional Neural Networks in TensorFlow:

What Are Convolutional Neural Networks?

Convolutional Neural Networks, like neural networks, are made up of neurons with learnable weights and biases. Each neuron receives several inputs, takes a weighted sum over them, pass it through an activation function and responds with an output.

The whole network has a loss function and all the tips and tricks that we developed for neural networks still apply on Convolutional Neural Networks.

Pretty straightforward, right?

Neural networks, as its name suggests, is a machine learning technique which is modeled after the brain structure. It comprises of a network of learning units called neurons.

These neurons learn how to convert input signals (e.g. picture of a cat) into corresponding output signals (e.g. the label “cat”), forming the basis of automated recognition.

Let’s take the example of automatic image recognition. The process of determining whether a picture contains a cat involves an activation function. If the picture resembles prior cat images the neurons have seen before, the label “cat” would be activated.

Hence, the more labeled images the neurons are exposed to, the better it learns how to recognize other unlabelled images. We call this the process of training neurons.

I (finally) finished chapter 5 of Deep Learning in R, which is all about CNNs.  It’s interesting just how open CNNs are for post hoc understanding, totally at odds with the classic neural network reputation for being a black box full of dark magic.

Using Azure ML To Approve Expenses Automatically

Isabelle Van Campenhoudt walks us through a scenario of using Azure ML to find expense reports which should automatically be approved, reducing the workload for approvers:

My partner in crime Serge Luca aka Doctor Flow is the author of a nice and complex expenses approval system in Microsoft Flow .
One year ago, he asked me to add analytics to his Flow.  This year he has the interesting idea to add a machine-learning based approval in his flow and suggest me to work on it. The idea is the following: Since we have a lot of approvals in our system, can a machine learn and found some decision pattern to apply automatically to each expenses request ?
I decided to use the Microsoft Azure Machine Learning Studio. In this tool you can build experiments and use some of the most common and useful machine learning algorithms. It was amazing to see how easy it is to create and consume machine learning .

This contrasts with Ginger Grant’s nightmare scenario pretty well:  instead of trying to get the ML process to do all of the work, create a process which takes care of the really easy stuff and leave harder tasks to specialists with a deeper understanding of the rules.  That way they don’t have to spend their time on trivialities.

No Laptop For You: A Case Of Machine Learning Failure

Ginger Grant walks us through a scenario where Lenovo refused to sell her a laptop four times:

Buying a laptop from Lenovo reminded me of an episode of Seinfeld when Elaine was trying to buy soup.  For some unknown reason, when I placed an order on their website and gave them my money, Lenovo gave me a Cancellation Notice, the email equivalent of “No Soup for you!”  After placing an order, about 15 minutes later, I received a cancellation notice.  I called customer service.  They looked at the order and advised me the systemincorrectly cancelled the order.  I was told to place the order again as they had resolved the problem.  I created a new order, and just like the last time, I received the No Laptop for You cancellation email.  I called back. This time I was told that the system thinks I am a fraud. Now I have no laptop and I have been insulted.

In all the talk of ML running the future, one thing that gets forgotten is that models, being simplifications of reality, necessarily make mistakes.  Failing to have some sort of manual override means, in this case, throwing away money for no good reason.

A Pessimistic View Of The State Of Deep Learning

William Vorhies provides us a negative (and necessary) look at the current state of Deep Learning solutions:

Reinforcement Learning (RL) is arguably the hottest research area in AI today because it appears RL can be adapted to any problem that has a well-defined reward function.  That encompasses game play, robotics, self-driving cars, and frankly pretty much else in machine learning.

Within RL, the hottest research area is Deep RL which means using a deep neural net as the ‘agent’ in the training.  Deep RL is seen as the form of RL with the most potential to generalize over the largest number of cases and perhaps the closest we’ve yet come to AGI (artificial general intelligence).

Importantly, Deep RL is also the technique used to win at Alpha Go which brought it huge attention.

The problem is, according to Alex Irpan, a researcher on the Google Brain Robotics team that about 70% of the time they just don’t work.

Alex has written a very comprehensive article critiquing the current state of Deep RL, the field with which he engages on a day-to-day basis.  He lays out a whole series of problems and we’ve elected to focus on the three that most clearly illustrate the current state of the problem with notes from his work.

Vorhies is not unduly negative and is optimistic in the medium to long term, but he is right in noting that there is a lot of work yet to do in this field.


February 2019
« Jan