Press "Enter" to skip to content

Category: Machine Learning

Uses Of kd-trees

Sandipan Dey explains what a kd-tree is and how it works:

The prime advantage of a 2d-tree over a BST is that it supports efficient implementation of range search and nearest-neighbor search. Each node corresponds to an axis-aligned rectangle, which encloses all of the points in its subtree. The root corresponds to the entire plane [(−∞, −∞), (+∞, +∞ )]; the left and right children of the root correspond to the two rectangles split by the x-coordinate of the point at the root; and so forth.

  • Range search: To find all points contained in a given query rectangle, start at the root and recursively search for points in both subtrees using the following pruning rule: if the query rectangle does not intersect the rectangle corresponding to a node, there is no need to explore that node (or its subtrees). That is, search a subtree only if it might contain a point contained in the query rectangle.

  • Nearest-neighbor search: To find a closest point to a given query point, start at the root and recursively search in both subtrees using the following pruning rule: if the closest point discovered so far is closer than the distance between the query point and the rectangle corresponding to a node, there is no need to explore that node (or its subtrees). That is, search a node only if it might contain a point that is closer than the best one found so far. The effectiveness of the pruning rule depends on quickly finding a nearby point. To do this, organize the recursive method so that when there are two possible subtrees to go down, you choose first the subtree that is on the same side of the splitting line as the query point; the closest point found while exploring the first subtree may enable pruning of the second subtree.

  • k-nearest neighbors search: This method returns the k points that are closest to the query point (in any order); return all n points in the data structure if n ≤ k. It must do this in an efficient manner, i.e. using the technique from kd-tree nearest neighbor search, not from brute force.

Sandipan implements a fairly classic problem in this space:  the behavior of a group of flocking birds.

Comments closed

Overfitting On Decision Trees

Ramandeep Kaur explains overfitting as well as how to prevent overfitting on decision trees:

Causes of Overfitting

There are two major situations that could cause overfitting in DTrees:

  1. Overfitting Due to Presence of Noise – Mislabeled instances may contradict the class labels of other similar records.
  2. Overfitting Due to Lack of Representative Instances – Lack of representative instances in the training data can prevent refinement of the learning algorithm.

                      A good model must not only fit the training data well
                      but also accurately classify records it has never seen.

How to avoid overfitting?

There are 2 major approaches to avoid overfitting in DTrees.

  1. approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data.

  2. approaches that allow the tree to overfit the data, and then post-prune the tree.

Click through for more details on these two approaches.

Comments closed

Sentiment Analysis In Power BI

Chris Webb has a new Power BI custom data connector:

I’m pleased to announce that I’ve published my first Power BI custom data connector on GitHub here:

https://github.com/cwebbbi/PowerBITextAnalytics

Basically, it acts as a wrapper for the Microsoft Cognitive Services Text Analytics API and  makes it extremely easy to do language detection, sentiment analysis and to extract key phrases from text when you are loading data into Power BI.

Read the whole thing, as Chris has a great demo of it.

Comments closed

Classifying Time Series Data With TensorFlow

Burak Himmetoglu applies a time series data set to two different types of neural networks using TensorFlow:

In this blog post, I will discuss the use of deep leaning methods to classify time-series data, without the need to manually engineer features. The example I will consider is the classic Human Activity Recognition (HAR) dataset from the UCI repository. The dataset contains the raw time-series data, as well as a pre-processed one with 561 engineered features. I will compare the performance of typical machine learning algorithms which use engineered features with two deep learning methods (convolutional and recurrent neural networks) and show that deep learning can surpass the performance of the former.

I have used Tensorflow for the implementation and training of the models discussed in this post.  In the discussion below, code snippets are provided to explain the implementation. For the complete code, please see my Github repository.

Click through for the samples, or check out the repo, linked above.

Comments closed

Using NLP To Find Similar Facebook Posts

The folks at Knoyd put together a word embedding example by scraping a Python Facebook group:

We are going to represent the content of a Facebook post using word embeddings and comparing the transformed posts using word mover’s distance. The combination of both have shown lower k-nearest neighbor-document classification error rates compared to other state of the art techniques.

The advantage of word embeddings is that the words which have similar meanings but don’t have any letters in common will still have similar vectors (be close) in the embedded space (e.g. lion and tiger).

There’s a good high-level discussion of techniques in this post.

Comments closed

TensorFlow On The Pi

Pete Warden shows how to install TensorFlow on a Raspberry Pi:

It’s never been easy to get TensorFlow installed on a Pi though. I had created a makefile script that let you build the C++ part from scratch, but it took several hours to complete and didn’t support Python. Sam Abrahams, an external contributor, did an amazing job maintaining a Python pip wheel for major releases, but building it required you to add swap space on a USB device for your Pi, and took even longer to compile than the makefile approach. Snips managed to get TensorFlow cross-compiling for Rust, but it wasn’t clear how to apply this to other languages.

Plenty of people on the team are Pi enthusiasts, and happily Eugene Brevdo dived in to investigate how we could improve the situation. We knew we wanted to have something that could be run as part of TensorFlow’s Jenkins continuous integration system, which meant building a completely automatic solution that would run with no user intervention. Since having a Pi plugged into a machine to run something like the makefile build would be hard to maintain, we did try using a hosted server from Mythic Beasts. Eugene got the makefile built going after a few hiccups, but the Python version required more RAM than was available, and we couldn’t plug in a USB drive remotely!

Read the whole thing, even if for the science experiment aspect.

Comments closed

Understanding Decision Trees

Ramandeep Kaur explains how decision trees work:

Simply put, a decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.

It is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems and works for both categorical and continuous input and output variables. It is one of the most widely used and practical methods for Inductive Inference. (Inductive inference is the process of reaching a general conclusion from specific examples.)

Decision trees learn and train itself from given examples and predict for unseen examples.

Click through for an example of implementing the ID3 algorithm and generating a decision tree from a data set.

Comments closed

Deep Learning Isn’t The End-All Be-All Solution

Pablo Cordero explains that deep learning solutions are not the best choice in all cases:

The second preconception I hear the most is the hype. Many yet-to-be practitioners expect deep nets to give them a mythical performance boost just because it worked in other fields. Others are inspired by impressive work in modeling and manipulating images, music, and language – three data types close to any human heart – and rush headfirst into the field by trying to train the latest GAN architecture. The hype is real in many ways. Deep learning has become an undeniable force in machine learning and an important tool in the arsenal of any data modeler. Its popularity has brought forth essential frameworks such as tensorflow and pytorch that are incredibly useful even outside deep learning. Its underdog to superstar origin story has inspired researchers to revisit other previously obscure methods like evolutionary strategies and reinforcement learning. But it’s not a panacea by any means. Aside from lunch considerations, deep learning models can be very nuanced and require careful and sometimes very expensive hyperparameter searches, tuning, and testing (much more on this later in the post). Besides, there are many cases where using deep learning just doesn’t make sense from a practical perspective and simpler models work much better.

It’s a very interesting article, pointing out that deep learning solutions work better than expected on smaller data sizes, but there are areas where it’s preferable to choose something else.

Comments closed

Scaling Out Random Forest

Denis C. Bauer, et al, explain VariantSpark RF, a random forest algorithm designed for huge numbers of variables:

VariantSpark RF starts by randomly assigning subsets of the data to Spark Executors for decision tree building (Fig 1). It then calculates the best split over all nodes and trees simultaneously. This implementation avoids communication bottlenecks between Spark Driver and Executors as information exchange is minimal, allowing it to build large numbers of trees efficiently. This surveys the solution space appropriately to cater for millions of features and thousands of samples.

Furthermore, VariantSpark RF has memory efficient representation of genomics data, optimized communication patterns and computation batching. It also provides efficient implementation of Out-Of-Bag (OOB) error, which substantially simplifies parameter tuning over the computationally more costly alternative of cross-validation.

We implemented VariantSpark RF in scala as it is the most performant interface languages to Apache Spark. Also, new updates to Spark and the interacting APIs will be deployed in scala first, which has been important when working on top of a fast evolving framework.

Give it a read.  Thankfully, I exhibit few of the traits of the degenerative disease known as Hipsterism.

Comments closed

Multiple Data Sets In External Scripts

Tomaz Kastrun shows a workaround to the “one data set” limit in sp_execute_external_script:

Some of the  arguments of the procedure sp_execute_external_script are enumerated. This is valid for the inputting dataset and as the name of argument @input_data_1 suggests, one can easily (and this is valid doubt) think, there can also be @input_data_2 argument, and so on. Unfortunately, this is not true.  External procedure can hold only one T-SQL dataset, inserted through this parameter.

There are many reasons for that, one would be the cost of sending several datasets to external process and back, so inadvertently, this forces user to rethink and pre-prepare the dataset (meaning, do all the data munging beforehand), prior to sending it into external procedure.

But there are workarounds on how to pass additional query/queries to sp_execute_external_script. I am not advocating this, and I strongly disagree with such usage, but here it is.

It does feel like a hinky solution, but sometimes you just need to get two data sets in.

Comments closed