This post covers the use of Qubole, Zeppelin, PySpark, and H2O PySparkling to develop a sentiment analysis model capable of providing real-time alerts on customer product reviews. In particular, this model allows users to monitor any natural language text (such as social media posts or Amazon reviews) and receive alerts when customers post extremely nice (high sentiment) or extremely negative (low sentiment) comments about their products.
In addition to introducing the frameworks used, we will also discuss the concepts of embedding spaces, sentiment analysis, deep neural networks, grid search, stop words, data visualization, and data preparation.
Click through for the demo.
MLlib is one of the primary extensions of Spark, along with Spark SQL, Spark Streaming and GraphX. It is a machine learning framework built from the ground up to be massively scalable and operate within Spark. This makes it an excellent choice for machine learning applications that need to crunch extremely large amounts of data. You can read more about Spark MLlib here.
In order to leverage Spark MLlib, we obviously need a way to execute Spark code. In our minds, there’s no better tool for this than Azure Databricks. In the previous post, we covered the creation of an Azure Databricks environment. We’re going to reuse that environment for this post as well. We’ll also use the same dataset that we’ve been using, which contains information about individual customers. This dataset was originally designed to predict Income based on a number of factors. However, we left the income out of this dataset a few posts back for reasons that were important then. So, we’re actually going to use this dataset to predict “Hours Per Week” instead.
Check it out. And Brad’s not joking when he says the resulting model is terrible. But that’s okay, because it was never about the model.
Since 2011, Stack Overflow has been surveying their users each year to answer questions about the technologies they use, their work experience, their compensation, and their satisfaction at work. Given Stack Overflow’s place in the broader programming world, they are able to draw quite the audience for their annual surveys.
This year, nearly 90,000 developers participated in the survey! There’s a lot in this survey, and I recommend reviewing it yourself, but I wanted to surface some of the key findings that I thought were particularly relevant to data professionals here.
Stack Overflow says they will be releasing the underlying data for this survey in the coming weeks, so I hope to return to this for a deeper analysis once that’s made available. For now, let’s get into the results!
Michael’s lede involves R versus Python in terms of salaries, but for me, the top line is that functional programmers make more money. Clojure, F#, Scala, Elixir, and Erlang make the top 10 on the global list, including positions 1, 2, 4, and 5. Within the US, Scala, Clojure, Erlang, Kotlin, F#, and Elixir make the top 10, including positions 1, 2, and 4. H/T R-Bloggers
K Nearest Neighbors is a classification algorithm that operates on a very simple principle. It is best shown through example! Imagine we had some imaginary data on Dogs and Horses, with heights and weights.
1. Store all the Data
1.Calculate the distance from x to all points in your data
2. Sort the points in your data by increasing distance from x
3. Predict the majority label of the “k” closest points
We will be working with the Titanic Data Set from Kaggle. We’ll be trying to predict a classification- survival or deceased.
Let’s begin by implementing Logistic Regression in Python for classification. We’ll use a “semi-cleaned” version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning.
Click through for the demo.
TextBlob is a python library tool and extension of NLTK. It provides a simple API approach to its methods and executes a large number of NLTK functions, and it also includes the pattern library functionality. You are just at the beginning, this might be an excellent tool to learning, and we can use it in applications production those don’t require heavy performant. TextBlob libraries are similar to python strings, so we can quickly transform and play similarly we performed in python. Finally, TextBlob is used in everywhere, and it is best suitable for smaller projects.
There are several tools from which you can choose. Sandeep also gives us some Node- and Java-based tools as well.
I’ve supplied these two python scripts in my
githubrepo at the following link. Firstwe need to open the install the relevant python libraries so you’ll need to issue the below pip command in whatever command tool you use, bash or cmd Prompt
pip install azure-eventhub
Check it out if you need pub-sub in Azure.
If I type the letter a into the R Script editor, my code completion options are acts, always, and, and as. Power BI’s editor is not offering any IntelliSense options from a Python or R dictionary. Instead, it’s pulling from the text already in the editor. Note the comment in Line 1 and the inclusion of words beginning with the letter a — always, and, acts, as.
By comparison, the DAX editor contains a detailed function list and helpful annotations for code completion. Can we get something similar for R and Python? Not exactly… But there’s a workaround that I’m almost embarrassed to suggest. If you are a user who codes directly into the script editor, the following hack could be helpful. If you use the option to Edit script in External IDE, keep doing that and ignore the following guidance.
As-is, this is worse than no IntelliSense because at least with no IntelliSense, it’ll never steal a mouse click or keystroke. I wouldn’t expect RStudio level quality out of the gate but unless I’m missing something, that’s pretty bad.
Each line in the HL7 message is called a segment and then each segment is split into individual fields by | (pipe) characters (typically). HL7 fields have well-defined names and meanings … for example in the example above PID-3 (the 3rd field in the PID segment where the identifier ‘PID’ is not counted) is 12001 and that represents the patient identifier.
For this particular project I’m working on we have HL7 messages stored in a SQL Server 2016 database table where each row in the table contains the raw HL7 2.x message in a particular column. I need to be able to intelligently filter over this HL7 data by looking at values in particular HL7 fields (as shown above). Since this HL7 data is stored in a varchar(MAX) column I could certainly attempt to play games using LIKE comparisons in SQL but that would not get me very far. SQL simply does not understand the complex structure of HL7 and I have no native SQL Server functions at my disposal that I could quickly use to parse this data and filter it.
Cristian has a Jupyter Notebook which takes us through the solution. With SQL Server 2017, there’s the possibility of solving this in a stored procedure using Machine Learning Services.
Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. This allows for concise and flexible scripts but can also be the downside of Airflow; since it’s Python code there are infinite ways to define your pipelines. The Zen of Python is a list of 19 Python design principles and in this blog post I point out some of these principles on four Airflow examples. This blog was written with Airflow 1.10.2.
My favorite of the Zen of Python principles is a combination of two: “simple is better than complex; complex is better than complicated.” That’s something I don’t always get right, but it is critical for a stable architecture.