TensorFlow On The Pi

Pete Warden shows how to install TensorFlow on a Raspberry Pi:

It’s never been easy to get TensorFlow installed on a Pi though. I had created a makefile script that let you build the C++ part from scratch, but it took several hours to complete and didn’t support Python. Sam Abrahams, an external contributor, did an amazing job maintaining a Python pip wheel for major releases, but building it required you to add swap space on a USB device for your Pi, and took even longer to compile than the makefile approach. Snips managed to get TensorFlow cross-compiling for Rust, but it wasn’t clear how to apply this to other languages.

Plenty of people on the team are Pi enthusiasts, and happily Eugene Brevdo dived in to investigate how we could improve the situation. We knew we wanted to have something that could be run as part of TensorFlow’s Jenkins continuous integration system, which meant building a completely automatic solution that would run with no user intervention. Since having a Pi plugged into a machine to run something like the makefile build would be hard to maintain, we did try using a hosted server from Mythic Beasts. Eugene got the makefile built going after a few hiccups, but the Python version required more RAM than was available, and we couldn’t plug in a USB drive remotely!

Read the whole thing, even if for the science experiment aspect.

Understanding Decision Trees

Ramandeep Kaur explains how decision trees work:

Simply put, a decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.

It is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems and works for both categorical and continuous input and output variables. It is one of the most widely used and practical methods for Inductive Inference. (Inductive inference is the process of reaching a general conclusion from specific examples.)

Decision trees learn and train itself from given examples and predict for unseen examples.

Click through for an example of implementing the ID3 algorithm and generating a decision tree from a data set.

Deep Learning Isn’t The End-All Be-All Solution

Pablo Cordero explains that deep learning solutions are not the best choice in all cases:

The second preconception I hear the most is the hype. Many yet-to-be practitioners expect deep nets to give them a mythical performance boost just because it worked in other fields. Others are inspired by impressive work in modeling and manipulating images, music, and language – three data types close to any human heart – and rush headfirst into the field by trying to train the latest GAN architecture. The hype is real in many ways. Deep learning has become an undeniable force in machine learning and an important tool in the arsenal of any data modeler. Its popularity has brought forth essential frameworks such as tensorflow and pytorch that are incredibly useful even outside deep learning. Its underdog to superstar origin story has inspired researchers to revisit other previously obscure methods like evolutionary strategies and reinforcement learning. But it’s not a panacea by any means. Aside from lunch considerations, deep learning models can be very nuanced and require careful and sometimes very expensive hyperparameter searches, tuning, and testing (much more on this later in the post). Besides, there are many cases where using deep learning just doesn’t make sense from a practical perspective and simpler models work much better.

It’s a very interesting article, pointing out that deep learning solutions work better than expected on smaller data sizes, but there are areas where it’s preferable to choose something else.

Scaling Out Random Forest

Denis C. Bauer, et al, explain VariantSpark RF, a random forest algorithm designed for huge numbers of variables:

VariantSpark RF starts by randomly assigning subsets of the data to Spark Executors for decision tree building (Fig 1). It then calculates the best split over all nodes and trees simultaneously. This implementation avoids communication bottlenecks between Spark Driver and Executors as information exchange is minimal, allowing it to build large numbers of trees efficiently. This surveys the solution space appropriately to cater for millions of features and thousands of samples.

Furthermore, VariantSpark RF has memory efficient representation of genomics data, optimized communication patterns and computation batching. It also provides efficient implementation of Out-Of-Bag (OOB) error, which substantially simplifies parameter tuning over the computationally more costly alternative of cross-validation.

We implemented VariantSpark RF in scala as it is the most performant interface languages to Apache Spark. Also, new updates to Spark and the interacting APIs will be deployed in scala first, which has been important when working on top of a fast evolving framework.

Give it a read.  Thankfully, I exhibit few of the traits of the degenerative disease known as Hipsterism.

Multiple Data Sets In External Scripts

Tomaz Kastrun shows a workaround to the “one data set” limit in sp_execute_external_script:

Some of the  arguments of the procedure sp_execute_external_script are enumerated. This is valid for the inputting dataset and as the name of argument @input_data_1 suggests, one can easily (and this is valid doubt) think, there can also be @input_data_2 argument, and so on. Unfortunately, this is not true.  External procedure can hold only one T-SQL dataset, inserted through this parameter.

There are many reasons for that, one would be the cost of sending several datasets to external process and back, so inadvertently, this forces user to rethink and pre-prepare the dataset (meaning, do all the data munging beforehand), prior to sending it into external procedure.

But there are workarounds on how to pass additional query/queries to sp_execute_external_script. I am not advocating this, and I strongly disagree with such usage, but here it is.

It does feel like a hinky solution, but sometimes you just need to get two data sets in.

Neural Networks From Scratch

Ilia Karmanov explains neural nets and shows how to build one in R:

Hence, my motivation for this post is two-fold:

  1. Understanding (by writing from scratch) the leaky abstractions behind neural-networks dramatically shifted my focus to elements whose importance I initially overlooked. If my model is not learning I have a better idea of what to address rather than blindly wasting time switching optimisers (or even frameworks).
  2. A deep-neural-network (DNN), once taken apart into lego blocks, is no longer a black-box that is inaccessible to other disciplines outside of AI. It’s a combination of many topics that are very familiar to most people with a basic knowledge of statistics. I believe they need to cover very little (just the glue that holds the blocks together) to get an insight into a whole new realm.

Starting from a linear regression we will work through the maths and the code all the way to a deep-neural-network (DNN) in the accompanying R-notebooks. Hopefully to show that very little is actually new information.

This is pretty detailed.  Karmanov mentions Andrej Karpathy, whose Hacker’s guide to Neural Networks is also a must-read on the topic.

Installing Python Support In SQL Server

Ginger Grant has a teaser for her upcoming 24 Hours of PASS talk:

The process for using Python in SQL Server is very similar to the previous process of installing R.  Microsoft renamed R Services to Machine Learning Services, and now allows both R and Python to be installed, as shown in the screen.  Microsoft’s version of Python uses Anaconda, which is an open source analytics platform created by Continuum. This is where Python differs from other open source languages, as Continuum is providing the version of Python as it contains data science components which are not included in the standard distribution of Python. Continuum also sells an enterprise version of Anaconda, with of course more features than come with the free version. It is important to remember the python environment as you will need select the same distribution when running Python code outside of SQL Server.

Read on to see how to install Python support in SQL Server 2017 and for a few links to tools.

Neural Nets With R And Power BI

Leila Etaati continues her series on using neural nets in Power BI:

we are going to predict the concrete strength using neural network. neural network can be used for predict a value or class, or it can be used for predicting multiple items. In this example, we are going to predict a value, that is concrete strength.

I have loaded the data in power bi first, and in “Query Editor” I am going to write some R codes. First we need to do some data transformations. As you can see in the below picture number 2,3 and 4,data is not in a same scale, we need to do some data normalization before applying any machine learning. I am going to write a code for that (Already explained the normalization in post KNN). So to write some R codes, I just click on the R transformation component (number 5).

There’s a lot going on in this demo; check it out.

Neural Nets On Spark

Nisha Muktewar and Seth Hendrickson show how to use Deeplearning4j to build deep learning models on Hadoop and Spark:

Modern convolutional networks can have several hundred million parameters. One of the top-performing neural networks in the Large Scale Visual Recognition Challenge (also known as “ImageNet”), has 140 million parameters to train! These networks not only take a lot of compute and storage resources (even with a cluster of GPUs, they can take weeks to train), but also require a lot of data. With only 30000 images, it is not practical to train such a complex model on Caltech-256 as there are not enough examples to adequately learn so many parameters. Instead, it is better to employ a method called transfer learning, which involves taking a pre-trained model and repurposing it for other use cases. Transfer learning can also greatly reduce the computational burden and remove the need for large swaths of specialized compute resources like GPUs.

It is possible to repurpose these models because convolutional neural networks tend to learn very general features when trained on image datasets, and this type of feature learning is often useful on other image datasets. For example, a network trained on ImageNet is likely to have learned how to recognize shapes, facial features, patterns, text, and so on, which will no doubt be useful for the Caltech-256 dataset.

This is a longer post, but on an extremely interesting topic.

Running H2O In R On Azure HDInsight

Daisy Deng shows how to configure HDInsight to be able to run the H2O package in R rather than Python or Scala:

We provide a few script actions for installing rsparkling on Azure HDInsight. When creating the HDInsight cluster, you can run the following script action for header node:


And run the following action for the worker node:


Please consult Customize Linux-based HDInsight clusters using Script Action for more details.

Click through for the full process.


August 2017
« Jul