The cliques considered here are defined by circles (in two dimensions) or spheres (in three dimensions.) In the most basic version, we have one clique for each cluster, and the clique is defined as the smallest circle containing a pre-specified proportion p of the points from the cluster in question. If the clusters are well separated, we can even use p = 1. We define the density of a clique as the number of points per unit area. In general, we want to build cliques with high density.
Ideally, we want each cluster in the training set to be covered by a small number of (possibly slightly overlapping) cliques, each one having a high density. Also, as a general rule, a training set point can only belong to one clique, and (ideally) to only one cluster. But the circles associated with two cliques are allowed to overlap.
It’s an interesting approach, and I can see how it’d be faster than K Nearest Neighbors, but I do wonder how accurate the results would be in comparison to KNN.
The process of training an image recognition system requires LOTS of images — millions and millions of them. The process involves feeding those images into a deep neural network, and during that process the network generates “features” from the image. These features might be versions of the image including just the outlines, or maybe the image with only the green parts. You could further boil those features down into a single number, say the length of the outline or the percentage of the image that is green. With enough of these “features”, you could use them in a traditional machine learning model to classify the images, or perform other recognition tasks.
But if you don’t have millions of images, it’s still possible to generate these features from a model that has already been trained on millions of images. ResNet is a very deep neural network model trained for the task of image recognition which has been used to win major computer-vision competitions. With the
rxFeaturizefunction in Microsoft R Client and Microsoft R Server, you can generate 4096 features from this model on any image you provide. The features themselves are meaningful only to a computer, but that vector of 4096 numbers between zero and one is (ideally) a distillation of the unique characteristics of that image as a human would recognize it. You can then use that features vector to create your own image-recognition system without the burden of training your own neural network on a large corpus of images.
Read the whole thing and follow David’s link to the Microsoft Cognitive blog for more details.
Simply put, a decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
It is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems and works for both categorical and continuous input and output variables. It is one of the most widely used and practical methods for Inductive Inference. (Inductive inference is the process of reaching a general conclusion from specific examples.)
Decision trees learn and train itself from given examples and predict for unseen examples.
Click through for an example of implementing the ID3 algorithm and generating a decision tree from a data set.
Tim Sweester and Aaron Bradley announce Diamond, a Python library which solves certain kinds of generalized linear models. In a two-part series, they explain more. Part 1 covers the mathematical principles behind it:
Many computational problems in data science and statistics can be cast as convex problems. There are many advantages to doing so:
- Convex problems have a unique global solution, i.e. there is one best answer
- There are well-known, efficient, and reliable algorithms for finding it
One ubiquitous example of a convex problem in data science is finding the coefficients of an-regularized logistic regression model using maximum likelihood. In this post, we’ll talk about some basic algorithms for convex optimization, and discuss our attempts to make them scale up to the size of our models. Unlike many applications, the “scale” challenge we faced was not the number of observations, but the number of features in our datasets. First, let’s review the model we want to fit.
In this example, GLMMs allow you to pool information across different brands, while still learning individual effects for each brand. It breaks the problem into sets of fixed and random effects. The fixed effects are similar to what you would find in a traditional logistic regression model, while the random effects allow the regression relationship to vary for each brand. One of the advantages of GLMMs is that they learn how different brands are from each other. Brands that are very similar to the overall average will have small random effect estimates. Because of the regularization of these models, brands with few observations will also have small random effect estimates, and be treated more like the overall average. In contrast, for brands that are very different from the average, with lots of data to support that, GLMMs will learn large random effect estimates.
Check it out. Part 2 also contains a link to the GitHub repo if you want to try it on your own.
Why do we assume in the first place? The answer is that making assumptions helps simplify problems and simplified problems can then be solved accurately. To divide your dataset into clusters, one must define the criteria of a cluster and those make the assumptions for the technique. K-Means clustering method considers two assumptions regarding the clusters – first that the clusters are spherical and second that the clusters are of similar size. Spherical assumption helps in separating the clusters when the algorithm works on the data and forms clusters. If this assumption is violated, the clusters formed may not be what one expects. On the other hand, assumption over the size of clusters helps in deciding the boundaries of the cluster. This assumption helps in calculating the number of data points each cluster should have. This assumption also gives an advantage. Clusters in K-means are defined by taking the mean of all the data points in the cluster. With this assumption, one can start with the centers of clusters anywhere. Keeping the starting points of the clusters anywhere will still make the algorithm converge with the same final clusters as keeping the centers as far apart as possible.
Read on as Chaitanya shows several examples; the polar coordinate transformation was quite interesting. H/T R-Bloggers
Once you’ve digested the illustration at the top of this article, yeah, you’ve kind already got it.
- Forecasting is when we anticipate the behavior of “Lots” of people (customers, typically) on “Long” timelines.
- Predictive Analytics anticipate the behavior of One person (again, typically a customer) on a “Short” timeline.
So… Macro versus Micro.
But let’s delve just a little bit deeper, in order to “cement” the concepts.
There’s a very useful distinction here and Rob does well to flesh out the details. I highly recommend this if you’re curious about micro- versus macro-level predictions.
The second preconception I hear the most is the hype. Many yet-to-be practitioners expect deep nets to give them a mythical performance boost just because it worked in other fields. Others are inspired by impressive work in modeling and manipulating images, music, and language – three data types close to any human heart – and rush headfirst into the field by trying to train the latest GAN architecture. The hype is real in many ways. Deep learning has become an undeniable force in machine learning and an important tool in the arsenal of any data modeler. Its popularity has brought forth essential frameworks such as tensorflow and pytorch that are incredibly useful even outside deep learning. Its underdog to superstar origin story has inspired researchers to revisit other previously obscure methods like evolutionary strategies and reinforcement learning. But it’s not a panacea by any means. Aside from lunch considerations, deep learning models can be very nuanced and require careful and sometimes very expensive hyperparameter searches, tuning, and testing (much more on this later in the post). Besides, there are many cases where using deep learning just doesn’t make sense from a practical perspective and simpler models work much better.
It’s a very interesting article, pointing out that deep learning solutions work better than expected on smaller data sizes, but there are areas where it’s preferable to choose something else.
The table above is a matrix of numbers. I am going to call it Z. The singular value decomposition is computed using the svd function. The following code computes the singular value decomposition of the matrix Z, and assigns it to a new object called SVD, which contains one vector, d, and two matrices, u and v. The vector, d, contains the singular values. The first matrix, u, contains the left singular vectors, and vcontains the right singular vectors. The left singular vectors represent the rows of the input table, and the right singular vectors represent their columns.
Tim includes R scripts to follow along, and for this topic I definitely recommend following along.
VariantSpark RF starts by randomly assigning subsets of the data to Spark Executors for decision tree building (Fig 1). It then calculates the best split over all nodes and trees simultaneously. This implementation avoids communication bottlenecks between Spark Driver and Executors as information exchange is minimal, allowing it to build large numbers of trees efficiently. This surveys the solution space appropriately to cater for millions of features and thousands of samples.
Furthermore, VariantSpark RF has memory efficient representation of genomics data, optimized communication patterns and computation batching. It also provides efficient implementation of Out-Of-Bag (OOB) error, which substantially simplifies parameter tuning over the computationally more costly alternative of cross-validation.
We implemented VariantSpark RF in scala as it is the most performant interface languages to Apache Spark. Also, new updates to Spark and the interacting APIs will be deployed in scala first, which has been important when working on top of a fast evolving framework.
Give it a read. Thankfully, I exhibit few of the traits of the degenerative disease known as Hipsterism.
Each of these people can be represented as points in a 3 Dimensional space. With a gross approximation, each people is in a 50*50*200 (cm) cube. If we use a resolution of 1cm and three color channels, then can be represented by 1,000,000 variables.
On the other hand, the shadow is only in 2 dimensions and in black and white, so each shadow only needs 50*200=10,000 variables.
The number of variables was divided by 100 ! And if your goal is to detect human vs cat, or even men vs women, the data from the shadow may be enough.
Read on for intuitive discussions of techniques like principal component analysis and linear discriminant analysis. H/T R-Bloggers