We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material.
Click through for the links, one with Python examples and the other with R examples.
TF-IDF was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.
However, if the word Bug appears many times in a document, while not appearing many times in others, it probably means that it’s very relevant. For example, if what we’re doing is trying to find out which topics some NPS responses belong to, the word Bug would probably end up being tied to the topic Reliability, since most responses containing that word would be about that topic.
This makes the technique useful for natural language processing, especially in classification problems.
Sentiment analysis is a set of Natural Language Processing (NLP) techniques that takes a text (in more academic circles, a document) written in natural language and extracts the opinions present in the text.
In a more practical sense, our objective here is to take a text and produce a label (or labels) that summarizes the sentiment of this text, e.g. positive, neutral, and negative.
For example, if we were dealing with hotel reviews, we would want the sentence ‘The staff were lovely‘ to be labeled as Positive, and the sentence ‘The shared bathroom was absolutely disgusting‘ labeled as Negative.
Click through for a demo.
The benefits of using groupdata2 to create the folds are 1) that it allows us to balance the ratios of our output classes (or simply a categorical column, if we are working with linear regression instead of classification), and 2) that it allows us to keep all observations with a specific ID (e.g. participant/user ID) in the same fold to avoid leakage between the folds.
The benefit of cvms is that it trains all the models and outputs a tibble (data frame) with results, predictions, model coefficients, and other sweet stuff, which is easy to add to a report or do further analyses on. It even allows us to cross-validate multiple model formulas at once to quickly compare them and select the best model.
Ludvig also gives us some examples of how both packages can help you out. H/T R-Bloggers
To understand relationship or dependencies among categorical variables, we take advantage of various types of tables and graphical methods. Also stratifying variables can be encompassed in order to highlight if the relationship between two primary variables is the same or different for all levels of the stratifying variable under consideration.
The contingency table are said to be of one-way flavor when involving just one categorical variable. They are said two-way when involving two categorical variables, and so on (N-way).
Read on for various techniques for data analysis against categorical variables.
If you want to determine the optimal number of clusters in your analysis, you’re faced with an overwhelming number of (mostly subjective) choices. Note that there’s no “best” method, no “correct” k, and there isn’t even a consensus as to the definition of what a “cluster” is. With that said, this picture focuses on three popular methods that should fit almost every need: Silhouette, Elbow, and Gap Statistic.
Click through for the picture and references.
Utilizing the equation for a line, instead of solving for y we will solve for x, where:
– x corresponds to the day we will hit capacity based on current growth rate
– y corresponds to drive capacity in GB
– m is the slope of our regression line, provided by the model via lm.coef_
– b is the intercept of the regression line, also provided by the model via lm.intercept_
Click through for an example. This is one of the areas where DBAs can gain a lot by learning a bit of data science.
Boxplots for each quantitative variables are shown. We take advantage of the quantitative variable names (quantitative_vars) determined before to apply a ggplot2 package based boxplot function. The Y axis labeling and title are determined by the variable to be plot. Further, legend is not displayed and we adopt the coordinate flip option for improved readability.
Check it out to get an idea of how to do exploratory data analysis.
This post covers the use of Qubole, Zeppelin, PySpark, and H2O PySparkling to develop a sentiment analysis model capable of providing real-time alerts on customer product reviews. In particular, this model allows users to monitor any natural language text (such as social media posts or Amazon reviews) and receive alerts when customers post extremely nice (high sentiment) or extremely negative (low sentiment) comments about their products.
In addition to introducing the frameworks used, we will also discuss the concepts of embedding spaces, sentiment analysis, deep neural networks, grid search, stop words, data visualization, and data preparation.
Click through for the demo.
MLlib is one of the primary extensions of Spark, along with Spark SQL, Spark Streaming and GraphX. It is a machine learning framework built from the ground up to be massively scalable and operate within Spark. This makes it an excellent choice for machine learning applications that need to crunch extremely large amounts of data. You can read more about Spark MLlib here.
In order to leverage Spark MLlib, we obviously need a way to execute Spark code. In our minds, there’s no better tool for this than Azure Databricks. In the previous post, we covered the creation of an Azure Databricks environment. We’re going to reuse that environment for this post as well. We’ll also use the same dataset that we’ve been using, which contains information about individual customers. This dataset was originally designed to predict Income based on a number of factors. However, we left the income out of this dataset a few posts back for reasons that were important then. So, we’re actually going to use this dataset to predict “Hours Per Week” instead.
Check it out. And Brad’s not joking when he says the resulting model is terrible. But that’s okay, because it was never about the model.