Vectors for Programmers

John Mount has a couple of videos available:

We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material.

Click through for the links, one with Python examples and the other with R examples.

Sentiment Analysis with Python

Bruno Stecanella shows us how to use MonkeyLearn to perform sentiment analysis in Python:

Sentiment analysis is a set of Natural Language Processing (NLP) techniques that takes a text (in more academic circles, a document) written in natural language and extracts the opinions present in the text.

In a more practical sense, our objective here is to take a text and produce a label (or labels) that summarizes the sentiment of this text, e.g. positiveneutral, and negative.

For example, if we were dealing with hotel reviews, we would want the sentence ‘The staff were lovely‘ to be labeled as Positive, and the sentence ‘The shared bathroom was absolutely disgusting‘ labeled as Negative.

Click through for a demo.

Predicting Database Growth

James Livingston uses linear regression to plot database growth over time:

Utilizing the equation for a line, instead of solving for y we will solve for x, where:
– x corresponds to the day we will hit capacity based on current growth rate
– y corresponds to drive capacity in GB
– m is the slope of our regression line, provided by the model via lm.coef_
– b is the intercept of the regression line, also provided by the model via lm.intercept_

Click through for an example. This is one of the areas where DBAs can gain a lot by learning a bit of data science.

Sentiment Analysis with Spark on Qubole

Jonathan Day, et al, have a tutorial on using Qubole to build a sentiment analysis model:

This post covers the use of Qubole, Zeppelin, PySpark, and H2O PySparkling to develop a sentiment analysis model capable of providing real-time alerts on customer product reviews. In particular, this model allows users to monitor any natural language text (such as social media posts or Amazon reviews) and receive alerts when customers post extremely nice (high sentiment) or extremely negative (low sentiment) comments about their products.

In addition to introducing the frameworks used, we will also discuss the concepts of embedding spaces, sentiment analysis, deep neural networks, grid search, stop words, data visualization, and data preparation.

Click through for the demo.

Running Spark MLlib to Feed Power BI

Brad Llewellyn shows how you can take Spark MLlib results and feed them into Power BI:

MLlib is one of the primary extensions of Spark, along with Spark SQL, Spark Streaming and GraphX.  It is a machine learning framework built from the ground up to be massively scalable and operate within Spark.  This makes it an excellent choice for machine learning applications that need to crunch extremely large amounts of data.  You can read more about Spark MLlib here.

In order to leverage Spark MLlib, we obviously need a way to execute Spark code.  In our minds, there’s no better tool for this than Azure Databricks.  In the previous post, we covered the creation of an Azure Databricks environment.  We’re going to reuse that environment for this post as well.  We’ll also use the same dataset that we’ve been using, which contains information about individual customers.  This dataset was originally designed to predict Income based on a number of factors.  However, we left the income out of this dataset a few posts back for reasons that were important then.  So, we’re actually going to use this dataset to predict “Hours Per Week” instead.

Check it out. And Brad’s not joking when he says the resulting model is terrible. But that’s okay, because it was never about the model.

Reviewing the Stack Overflow Developer Survey

Michael Toth looks at the recently-released 2019 Stack Overflow Developer Survey:

Since 2011, Stack Overflow has been surveying their users each year to answer questions about the technologies they use, their work experience, their compensation, and their satisfaction at work. Given Stack Overflow’s place in the broader programming world, they are able to draw quite the audience for their annual surveys.

This year, nearly 90,000 developers participated in the survey! There’s a lot in this survey, and I recommend reviewing it yourself, but I wanted to surface some of the key findings that I thought were particularly relevant to data professionals here.

Stack Overflow says they will be releasing the underlying data for this survey in the coming weeks, so I hope to return to this for a deeper analysis once that’s made available. For now, let’s get into the results!

Michael’s lede involves R versus Python in terms of salaries, but for me, the top line is that functional programmers make more money. Clojure, F#, Scala, Elixir, and Erlang make the top 10 on the global list, including positions 1, 2, 4, and 5. Within the US, Scala, Clojure, Erlang, Kotlin, F#, and Elixir make the top 10, including positions 1, 2, and 4. H/T R-Bloggers

K-Nearest Neighbors in Python

Hardik Jaroli shows how to use the k-Nearest Neighbors algorithm using scikit-learn:

K Nearest Neighbors is a classification algorithm that operates on a very simple principle. It is best shown through example! Imagine we had some imaginary data on Dogs and Horses, with heights and weights.

Training Algorithm:
1. Store all the Data

Prediction Algorithm:
1.Calculate the distance from x to all points in your data
2. Sort the points in your data by increasing distance from x
3. Predict the majority label of the “k” closest points

Solving Logistic Regression Problems with Python

Hardik Jaroli shows how we can solve logistic regression problems using Python, using the Titanic data set as an example:

We will be working with the Titanic Data Set from Kaggle. We’ll be trying to predict a classification- survival or deceased.

Let’s begin by implementing Logistic Regression in Python for classification. We’ll use a “semi-cleaned” version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning.

Click through for the demo.

Python Natural Language Processing Tools

Sandeep Aspari takes us through some of the tooling available in Python around Natural Language Processing:

TextBlob is a python library tool and extension of NLTK. It provides a simple API approach to its methods and executes a large number of NLTK functions, and it also includes the pattern library functionality. You are just at the beginning, this might be an excellent tool to learning, and we can use it in applications production those don’t require heavy performant. TextBlob libraries are similar to python strings, so we can quickly transform and play similarly we performed in python. Finally, TextBlob is used in everywhere, and it is best suitable for smaller projects.

There are several tools from which you can choose. Sandeep also gives us some Node- and Java-based tools as well.

Accessing Azure Event Hubs with Python

Neil Gelder shows us how you can write Python code to work with Azure Event Hubs:

I’ve supplied these two python scripts in my github repo at the following link.

First we need to open the install the relevant python libraries so you’ll need to issue the below pip command in whatever command tool you use, bash or cmd Prompt

pip install azure-eventhub

Check it out if you need pub-sub in Azure.


May 2019
« Apr