Boxplots for each quantitative variables are shown. We take advantage of the quantitative variable names (quantitative_vars) determined before to apply a ggplot2 package based boxplot function. The Y axis labeling and title are determined by the variable to be plot. Further, legend is not displayed and we adopt the coordinate flip option for improved readability.
Check it out to get an idea of how to do exploratory data analysis.
This post covers the use of Qubole, Zeppelin, PySpark, and H2O PySparkling to develop a sentiment analysis model capable of providing real-time alerts on customer product reviews. In particular, this model allows users to monitor any natural language text (such as social media posts or Amazon reviews) and receive alerts when customers post extremely nice (high sentiment) or extremely negative (low sentiment) comments about their products.
In addition to introducing the frameworks used, we will also discuss the concepts of embedding spaces, sentiment analysis, deep neural networks, grid search, stop words, data visualization, and data preparation.
Click through for the demo.
MLlib is one of the primary extensions of Spark, along with Spark SQL, Spark Streaming and GraphX. It is a machine learning framework built from the ground up to be massively scalable and operate within Spark. This makes it an excellent choice for machine learning applications that need to crunch extremely large amounts of data. You can read more about Spark MLlib here.
In order to leverage Spark MLlib, we obviously need a way to execute Spark code. In our minds, there’s no better tool for this than Azure Databricks. In the previous post, we covered the creation of an Azure Databricks environment. We’re going to reuse that environment for this post as well. We’ll also use the same dataset that we’ve been using, which contains information about individual customers. This dataset was originally designed to predict Income based on a number of factors. However, we left the income out of this dataset a few posts back for reasons that were important then. So, we’re actually going to use this dataset to predict “Hours Per Week” instead.
Check it out. And Brad’s not joking when he says the resulting model is terrible. But that’s okay, because it was never about the model.
This post will be a pretty short one. In my talk, I don’t have any demos, mostly because much of cohort analysis has secretly been time series analysis at the same time. Instead, I’ll lob out a few points and call it a day.
Time series analysis, at its core, is all about how your data changes over time. The grain for time series analysis is important: as we saw in the last post, we were able to get an excellent result at the yearly level when regressing number of active buses versus number of line items.
Spoilers: it’s not as short as I thought it would be.
Naive Bayes is a Supervised Machine Learning algorithm based on the Bayes Theorem that is used to solve classification problems by following a probabilistic approach. It is based on the idea that the predictor variables in a Machine Learning model are independent of each other. Meaning that the outcome of a model depends on a set of independent variables that have nothing to do with each other.
Naive Bayes is one of the simplest algorithms available and yet it works pretty well most of the time. It’s almost never the best solution but it’s typically good enough to give you an idea of whether you can get a job done.
In the last post, we focused on high-level aggregates to gain a basic understanding of our data. We saw some suspicious results but couldn’t say much more than “This looks weird” due to our level of aggregation. In this post, I want to dig into data at a lower level of detail. My working conception is the cohort, a broad-based comparison of data sliced by some business-relevant or analysis-relevant component.
Those familiar with Kimball-style data warehousing already understand where I’m going with this. In the basic analysis, we essentially look at fact data with a little bit of disaggregation, such as looking at data by year. In this analysis, we introduce dimensions (sort of) and slice our data by dimensions.
Click through for some fraud-finding fun.
Bayes’ Theorem is a way to calculate conditional probability. The formula is very simple to calculate, but it can be challenging to fit the right pieces into the puzzle. The first challenge comes from defining your event (A) and test (B); The second challenge is rephrasing your question so that you can work backwards: turning P(A|B) into P(B|A). The following image shows a basic example involving website traffic. For more simple examples, see: Bayes Theorem Problems.
Click through for the image and related links.
Growth analysis focuses on changes in ratios over time. For example, you may plot annual revenue, cost, and net margin by year. Doing this gives you an idea of how the company is doing: if costs are flat but revenue increases, you can assume economies of scale or economies of scope are in play and that’s a great thing. If revenue is going up but costs are increasing faster, that’s not good for the company’s long-term outlook.
For our data set, I’m going to use the following SQL query to retrieve bus counts on the first day of each year. To make the problem easier, I add and remove buses on that day, so we don’t need to look at every day or perform complicated analyses.
I get into quite a bit in this post, including a quick tour of multicollinearity, which is only my second-favorite of the three linear regression amigos (heteroskedasticity being my favorite and autocorrelation the hanger-on).
A common undertaking in applied research settings such as in some areas of psychology is to convert a raw score into some type of standardized score such as z-scores.
This post shows a way how to accomplish that.
Read on for three techniques.
K Nearest Neighbors is a classification algorithm that operates on a very simple principle. It is best shown through example! Imagine we had some imaginary data on Dogs and Horses, with heights and weights.
1. Store all the Data
1.Calculate the distance from x to all points in your data
2. Sort the points in your data by increasing distance from x
3. Predict the majority label of the “k” closest points