Press "Enter" to skip to content

Category: Data Science

K-Nearest Neighbors in Python

Hardik Jaroli shows how to use the k-Nearest Neighbors algorithm using scikit-learn:

K Nearest Neighbors is a classification algorithm that operates on a very simple principle. It is best shown through example! Imagine we had some imaginary data on Dogs and Horses, with heights and weights.

Training Algorithm:
1. Store all the Data

Prediction Algorithm:
1.Calculate the distance from x to all points in your data
2. Sort the points in your data by increasing distance from x
3. Predict the majority label of the “k” closest points

Comments closed

Learning with Limited Data

Shioulin Sam and Nisha Muktewar have new research on machine learning when getting labeled data is time-consuming or difficult:

We are excited to release Learning with Limited Labeled Data, the latest report and prototype from Cloudera Fast Forward Labs.

Being able to learn with limited labeled data relaxes the stringent labeled data requirement for supervised machine learning. Our report focuses on active learning, a technique that relies on collaboration between machines and humans to label smartly.

Active learning makes it possible to build applications using a small set of labeled data, and enables enterprises to leverage their large pools of unlabeled data. In this blog post, we explore how active learning works. (For a higher level introduction, please see our previous blogpost.

The research itself is behind a paywall but you can see their write-up to get an idea of the topic.

Comments closed

Getting Started with Azure Databricks

Brad Llewellyn has a tutorial for Azure Databricks:

Databricks is a managed Spark framework, similar to what we saw with HDInsight in the previous post.  The major difference between the two technologies is that HDInsight is more of a managed provisioning service for Hadoop, while Databricks is more like a managed Spark platform.  In other words, HDInsight is a good choice if we need the ability to manage the cluster ourselves, but don’t want to deal with provisioning, while Databricks is a good choice when we simply want to have a Spark environment for running our code with little need for maintenance or management.

Azure Databricks is not a Microsoft product.  It is owned and managed by the company Databricks and available in Azure and AWS.  However, Databricks is a “first party offering” in Azure.  This means that Microsoft offers the same level of support, functionality and integration as it would with any of its own products.  You can read more about Azure Databricks herehereand here.

Click through for a demonstration of the product.

Comments closed

Solving Logistic Regression Problems with Python

Hardik Jaroli shows how we can solve logistic regression problems using Python, using the Titanic data set as an example:

We will be working with the Titanic Data Set from Kaggle. We’ll be trying to predict a classification- survival or deceased.

Let’s begin by implementing Logistic Regression in Python for classification. We’ll use a “semi-cleaned” version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning.

Click through for the demo.

Comments closed

Finding an Unfair Coin with R

Sebastian Sauer works out a coin flip problem:

A stochastic problem, with application to financial theory. Some say it goes back to Warren Buffett. I relied to my colleague Norman Markgraf, who pointed it out to me.

Assume there are two coins. One is fair, one is loaded. The loaded coin has a bias of 60-40. Now, the question is: How many coin flips do you need to be “sure enough” (say, 95%) that you found the loaded coin?

Let’s simulate la chose.

It took a few more flips than I had expected but the number is not outlandish.

Comments closed

Python Natural Language Processing Tools

Sandeep Aspari takes us through some of the tooling available in Python around Natural Language Processing:

TextBlob
TextBlob is a python library tool and extension of NLTK. It provides a simple API approach to its methods and executes a large number of NLTK functions, and it also includes the pattern library functionality. You are just at the beginning, this might be an excellent tool to learning, and we can use it in applications production those don’t require heavy performant. TextBlob libraries are similar to python strings, so we can quickly transform and play similarly we performed in python. Finally, TextBlob is used in everywhere, and it is best suitable for smaller projects.

There are several tools from which you can choose. Sandeep also gives us some Node- and Java-based tools as well.

Comments closed

Residual Analysis with R

Abhijit Telang shares a few techniques for doing post-regression residual analysis using R:

Naturally, I would expect my model to be unbiased, at least in intention, and hence any leftovers on either side of the regression line that did not make it on the line are expected to be random, i.e. without any particular pattern.

That is, I expect my residual error distributions to follow a bland, normal distribution.

In R, you can do this elegantly with just two lines of code. 
1. Plot a histogram of residuals 
2. Add a Quantile-Quantile plot with a line that passes through, namely, the first and third quantiles.

There are several more techniques in here to analyze residuals, so check it out.

Comments closed

The Costs of Specialization within Data Science

Eric Colson argues in favor of data science generalists rather than specialists:

But the goal of data science is not to execute. Rather, the goal is to learn and develop profound new business capabilities. Algorithmic products and services like recommendations systemsclient engagement banditsstyle preference classificationsize matchingfashion design systemslogistics optimizersseasonal trend detection, and more can’t be designed up-front. They need to be learned. There are no blueprints to follow; these are novel capabilities with inherent uncertainty. Coefficients, models, model types, hyper parameters, all the elements you’ll need must be learned through experimentation, trial and error, and iteration. With pins, the learning and design are done up-front, before you produce them. With data science, you learn as you go, not before you go.

In the pin factory, when learning comes first we do not expect, nor do we want, the workers to improvise on any aspect the product, except to produce it more efficiently. Organizing by function makes sense since task specialization leads to process efficiencies and production consistency (no variations in the end product).

I think this article captures the downside risk of specialization, but not the downside risks of generalization: some people simply aren’t very good at some things, leading to huge amounts of technical debt down the road or failing a project due to the lack of requisite knowledge or skills. To give a personal example, I have a generalist team, but I still control the data flows (at the very least doing thorough code reviews of any database changes), my application specialist controls app architecture, my statistician reviews algorithms, etc. I don’t claim that this is the best strategy, but a group of pure generalists will have their own set of problems too.

Comments closed

Accidentally Building a Population Graph

Neil Saunders shares an example of a newspaper headline which ultimately just shows us population sizes:

Some poking around in the NSW Transport Open Data portal reveals how many people enter every Sydney train station on a “typical” day in 2016, 2017 and 2018. We could manipulate those numbers in various ways to estimate total, unique passengers for FY 2017-18 but I’m going to argue that the value as-is serves as a proxy variable for “station busyness”.

When working with spatial data cases, it’s important to differentiate an effect you see because it’s actually unique or interesting versus an effect you see because that’s where all of the people are.

Comments closed