Non-Linear Classifiers with Support Vector Machines

Rahul Khanna continues a series on support vector machines:

In this blog post, we will look at a detailed explanation of how to use SVM for complex decision boundaries and build Non-Linear Classifiers using SVM. The primary method for doing this is by using Kernels.

In linear SVM we find margin maximizing hyperplane with features Xi’s . Similarly, in Logistic regression, we also try to find the hyperplane which minimizes logistic loss with features Xi’s. Most often when we use both these techniques the results are the same. But linear SVM or for the same reason a logistic regression would fail where there is a need to have complex or non-linear decision boundaries. These types of boundaries are then achieved by SVM using Kernels. So let us understand how SVM creates non-linear boundaries using Kernels

Read on to see how it works.

Vectors for Programmers

John Mount has a couple of videos available:

We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material.

Click through for the links, one with Python examples and the other with R examples.

Defining TF-IDF

Bruno Stecanella explains the concept behind TF-IDF:

TF-IDF was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.

However, if the word Bug appears many times in a document, while not appearing many times in others, it probably means that it’s very relevant. For example, if what we’re doing is trying to find out which topics some NPS responses belong to, the word Bug would probably end up being tied to the topic Reliability, since most responses containing that word would be about that topic.

This makes the technique useful for natural language processing, especially in classification problems.

Sentiment Analysis with Python

Bruno Stecanella shows us how to use MonkeyLearn to perform sentiment analysis in Python:

Sentiment analysis is a set of Natural Language Processing (NLP) techniques that takes a text (in more academic circles, a document) written in natural language and extracts the opinions present in the text.

In a more practical sense, our objective here is to take a text and produce a label (or labels) that summarizes the sentiment of this text, e.g. positiveneutral, and negative.

For example, if we were dealing with hotel reviews, we would want the sentence ‘The staff were lovely‘ to be labeled as Positive, and the sentence ‘The shared bathroom was absolutely disgusting‘ labeled as Negative.

Click through for a demo.

Repeated Cross-Validation in R

Ludvig Olsen walks us through a couple of nice R packages:

The benefits of using groupdata2 to create the folds are 1) that it allows us to balance the ratios of our output classes (or simply a categorical column, if we are working with linear regression instead of classification), and 2) that it allows us to keep all observations with a specific ID (e.g. participant/user ID) in the same fold to avoid leakage between the folds.

The benefit of cvms is that it trains all the models and outputs a tibble (data frame) with results, predictions, model coefficients, and other sweet stuff, which is easy to add to a report or do further analyses on. It even allows us to cross-validate multiple model formulas at once to quickly compare them and select the best model.

Ludvig also gives us some examples of how both packages can help you out. H/T R-Bloggers

Exploratory Data Analysis on Categorical Variables

Giorgio Garziano continues digging into earthquake data:

To understand relationship or dependencies among categorical variables, we take advantage of various types of tables and graphical methods. Also stratifying variables can be encompassed in order to highlight if the relationship between two primary variables is the same or different for all levels of the stratifying variable under consideration.

The contingency table are said to be of one-way flavor when involving just one categorical variable. They are said two-way when involving two categorical variables, and so on (N-way).

Read on for various techniques for data analysis against categorical variables.

Techniques for Determining Cluster Counts

Stephanie Glen has an image which describes different techniques for figuring out how many clusters you have:

If you want to determine the optimal number of clusters in your analysis, you’re faced with an overwhelming number of (mostly subjective) choices. Note that there’s no “best” method, no “correct” k, and there isn’t even a consensus as to the definition of what a “cluster” is. With that said, this picture focuses on three popular methods that should fit almost every need: Silhouette, Elbow, and Gap Statistic.

Click through for the picture and references.

Predicting Database Growth

James Livingston uses linear regression to plot database growth over time:

Utilizing the equation for a line, instead of solving for y we will solve for x, where:
– x corresponds to the day we will hit capacity based on current growth rate
– y corresponds to drive capacity in GB
– m is the slope of our regression line, provided by the model via lm.coef_
– b is the intercept of the regression line, also provided by the model via lm.intercept_

Click through for an example. This is one of the areas where DBAs can gain a lot by learning a bit of data science.

Exploratory Analysis of Earthquake Data

Giorgio Garziano walks us through an earthquake data set:

Boxplots for each quantitative variables are shown. We take advantage of the quantitative variable names (quantitative_vars) determined before to apply a ggplot2 package based boxplot function. The Y axis labeling and title are determined by the variable to be plot. Further, legend is not displayed and we adopt the coordinate flip option for improved readability.

Check it out to get an idea of how to do exploratory data analysis.

Sentiment Analysis with Spark on Qubole

Jonathan Day, et al, have a tutorial on using Qubole to build a sentiment analysis model:

This post covers the use of Qubole, Zeppelin, PySpark, and H2O PySparkling to develop a sentiment analysis model capable of providing real-time alerts on customer product reviews. In particular, this model allows users to monitor any natural language text (such as social media posts or Amazon reviews) and receive alerts when customers post extremely nice (high sentiment) or extremely negative (low sentiment) comments about their products.

In addition to introducing the frameworks used, we will also discuss the concepts of embedding spaces, sentiment analysis, deep neural networks, grid search, stop words, data visualization, and data preparation.

Click through for the demo.

Categories

May 2019
MTWTFSS
« Apr  
 12345
6789101112
13141516171819
20212223242526
2728293031