Press "Enter" to skip to content

Category: Machine Learning

Scaling Out Random Forest

Denis C. Bauer, et al, explain VariantSpark RF, a random forest algorithm designed for huge numbers of variables:

VariantSpark RF starts by randomly assigning subsets of the data to Spark Executors for decision tree building (Fig 1). It then calculates the best split over all nodes and trees simultaneously. This implementation avoids communication bottlenecks between Spark Driver and Executors as information exchange is minimal, allowing it to build large numbers of trees efficiently. This surveys the solution space appropriately to cater for millions of features and thousands of samples.

Furthermore, VariantSpark RF has memory efficient representation of genomics data, optimized communication patterns and computation batching. It also provides efficient implementation of Out-Of-Bag (OOB) error, which substantially simplifies parameter tuning over the computationally more costly alternative of cross-validation.

We implemented VariantSpark RF in scala as it is the most performant interface languages to Apache Spark. Also, new updates to Spark and the interacting APIs will be deployed in scala first, which has been important when working on top of a fast evolving framework.

Give it a read.  Thankfully, I exhibit few of the traits of the degenerative disease known as Hipsterism.

Comments closed

Multiple Data Sets In External Scripts

Tomaz Kastrun shows a workaround to the “one data set” limit in sp_execute_external_script:

Some of the  arguments of the procedure sp_execute_external_script are enumerated. This is valid for the inputting dataset and as the name of argument @input_data_1 suggests, one can easily (and this is valid doubt) think, there can also be @input_data_2 argument, and so on. Unfortunately, this is not true.  External procedure can hold only one T-SQL dataset, inserted through this parameter.

There are many reasons for that, one would be the cost of sending several datasets to external process and back, so inadvertently, this forces user to rethink and pre-prepare the dataset (meaning, do all the data munging beforehand), prior to sending it into external procedure.

But there are workarounds on how to pass additional query/queries to sp_execute_external_script. I am not advocating this, and I strongly disagree with such usage, but here it is.

It does feel like a hinky solution, but sometimes you just need to get two data sets in.

Comments closed

Neural Networks From Scratch

Ilia Karmanov explains neural nets and shows how to build one in R:

Hence, my motivation for this post is two-fold:

  1. Understanding (by writing from scratch) the leaky abstractions behind neural-networks dramatically shifted my focus to elements whose importance I initially overlooked. If my model is not learning I have a better idea of what to address rather than blindly wasting time switching optimisers (or even frameworks).
  2. A deep-neural-network (DNN), once taken apart into lego blocks, is no longer a black-box that is inaccessible to other disciplines outside of AI. It’s a combination of many topics that are very familiar to most people with a basic knowledge of statistics. I believe they need to cover very little (just the glue that holds the blocks together) to get an insight into a whole new realm.

Starting from a linear regression we will work through the maths and the code all the way to a deep-neural-network (DNN) in the accompanying R-notebooks. Hopefully to show that very little is actually new information.

This is pretty detailed.  Karmanov mentions Andrej Karpathy, whose Hacker’s guide to Neural Networks is also a must-read on the topic.

Comments closed

Installing Python Support In SQL Server

Ginger Grant has a teaser for her upcoming 24 Hours of PASS talk:

The process for using Python in SQL Server is very similar to the previous process of installing R.  Microsoft renamed R Services to Machine Learning Services, and now allows both R and Python to be installed, as shown in the screen.  Microsoft’s version of Python uses Anaconda, which is an open source analytics platform created by Continuum. This is where Python differs from other open source languages, as Continuum is providing the version of Python as it contains data science components which are not included in the standard distribution of Python. Continuum also sells an enterprise version of Anaconda, with of course more features than come with the free version. It is important to remember the python environment as you will need select the same distribution when running Python code outside of SQL Server.

Read on to see how to install Python support in SQL Server 2017 and for a few links to tools.

Comments closed

Neural Nets With R And Power BI

Leila Etaati continues her series on using neural nets in Power BI:

we are going to predict the concrete strength using neural network. neural network can be used for predict a value or class, or it can be used for predicting multiple items. In this example, we are going to predict a value, that is concrete strength.

I have loaded the data in power bi first, and in “Query Editor” I am going to write some R codes. First we need to do some data transformations. As you can see in the below picture number 2,3 and 4,data is not in a same scale, we need to do some data normalization before applying any machine learning. I am going to write a code for that (Already explained the normalization in post KNN). So to write some R codes, I just click on the R transformation component (number 5).

There’s a lot going on in this demo; check it out.

Comments closed

Neural Nets On Spark

Nisha Muktewar and Seth Hendrickson show how to use Deeplearning4j to build deep learning models on Hadoop and Spark:

Modern convolutional networks can have several hundred million parameters. One of the top-performing neural networks in the Large Scale Visual Recognition Challenge (also known as “ImageNet”), has 140 million parameters to train! These networks not only take a lot of compute and storage resources (even with a cluster of GPUs, they can take weeks to train), but also require a lot of data. With only 30000 images, it is not practical to train such a complex model on Caltech-256 as there are not enough examples to adequately learn so many parameters. Instead, it is better to employ a method called transfer learning, which involves taking a pre-trained model and repurposing it for other use cases. Transfer learning can also greatly reduce the computational burden and remove the need for large swaths of specialized compute resources like GPUs.

It is possible to repurpose these models because convolutional neural networks tend to learn very general features when trained on image datasets, and this type of feature learning is often useful on other image datasets. For example, a network trained on ImageNet is likely to have learned how to recognize shapes, facial features, patterns, text, and so on, which will no doubt be useful for the Caltech-256 dataset.

This is a longer post, but on an extremely interesting topic.

Comments closed

Running H2O In R On Azure HDInsight

Daisy Deng shows how to configure HDInsight to be able to run the H2O package in R rather than Python or Scala:

We provide a few script actions for installing rsparkling on Azure HDInsight. When creating the HDInsight cluster, you can run the following script action for header node:

https://bostoncaqs.blob.core.windows.net/scriptaction/scriptaction-head.sh

And run the following action for the worker node:

https://bostoncaqs.blob.core.windows.net/scriptaction/scriptaction-worker.sh

Please consult Customize Linux-based HDInsight clusters using Script Action for more details.

Click through for the full process.

Comments closed

Basics Of Neural Nets

Leila Etaati has a new series on neural nets in R:

in Neural Network, we have some hidden Nodes that do the main job ! they found the best value for the output, they are using some function that we call that functions as “Activation function” for instance in below picture, Node C is a hidden node that take the values from node A and B. as you can see the weight (the better path) related to Node B as shown in tick line that means Node B may lead to get better results so Node C get input values from Node B not Node A.

If you have time, also check out the linked YouTube videos.

Comments closed

Envisioning Neural Nets As Org Charts

Maiia Bakhova describes the layout of a neural net as similar to a chain of command within an organization:

We can observe a lot of in common with a corporation chain of command. As we see middle managers are hidden layers which do the balk of the job.  We have the similar information flow and processing which is analogous to forward propagation and backward propagation.
What is left now is to explain that  dealing with sigmoid function at each node is too costly so it mostly reserved for CEO level.

That’s a metaphor I hadn’t heard before.

Comments closed

Spark And H2O

Avkash Chauhan shows how to use sparklyr and rsparkling to tie Spark together with the H2O library in R:

In order to work with Spark H2O using rsparkling and sparklyr in R, you must first ensure that you have both sparklyr and rsparkling installed.

Once you’ve done that, you can check out the working script, the code for testing the Spark context, and the code for launching H2O Flow. All of this information can be found below.

It’s a short post, but it does show how to kick off a job.

Comments closed