# R Versus Python

2017-08-31

We use the app in question to compare search interest for R data Science versus Python Data Science, see above chart.  It looks like until December 2016, R dominated, but fell below Python by early 2017. The above chart displays an interest index, 100 being maximum and 0 being minimum. Click here to access this interactive chart on Google, and check the results for countries other than US, or even for specific regions such as California or New York.

Note that Python always dominated R by a long shot, because it is a general-purpose language, while R is a specialized language. But here, we compare R and Python in the niche context of data science. The map below shows interest for Python (general purpose) per region, using the same Google index in question.

It’s an interesting look at the relative shift between R and Python as a primary language for statistical analysis.

# Classifying Time Series Data With TensorFlow

2017-08-25

In this blog post, I will discuss the use of deep leaning methods to classify time-series data, without the need to manually engineer features. The example I will consider is the classic Human Activity Recognition (HAR) dataset from the UCI repository. The dataset contains the raw time-series data, as well as a pre-processed one with 561 engineered features. I will compare the performance of typical machine learning algorithms which use engineered features with two deep learning methods (convolutional and recurrent neural networks) and show that deep learning can surpass the performance of the former.

I have used Tensorflow for the implementation and training of the models discussed in this post.  In the discussion below, code snippets are provided to explain the implementation. For the complete code, please see my Github repository.

Click through for the samples, or check out the repo, linked above.

# Using NLP To Find Similar Facebook Posts

2017-08-22

We are going to represent the content of a Facebook post using word embeddings and comparing the transformed posts using word mover’s distance. The combination of both have shown lower k-nearest neighbor-document classification error rates compared to other state of the art techniques.

The advantage of word embeddings is that the words which have similar meanings but don’t have any letters in common will still have similar vectors (be close) in the embedded space (e.g. lion and tiger).

There’s a good high-level discussion of techniques in this post.

# TensorFlow On The Pi

2017-08-21

It’s never been easy to get TensorFlow installed on a Pi though. I had created a makefile script that let you build the C++ part from scratch, but it took several hours to complete and didn’t support Python. Sam Abrahams, an external contributor, did an amazing job maintaining a Python pip wheel for major releases, but building it required you to add swap space on a USB device for your Pi, and took even longer to compile than the makefile approach. Snips managed to get TensorFlow cross-compiling for Rust, but it wasn’t clear how to apply this to other languages.

Plenty of people on the team are Pi enthusiasts, and happily Eugene Brevdo dived in to investigate how we could improve the situation. We knew we wanted to have something that could be run as part of TensorFlow’s Jenkins continuous integration system, which meant building a completely automatic solution that would run with no user intervention. Since having a Pi plugged into a machine to run something like the makefile build would be hard to maintain, we did try using a hosted server from Mythic Beasts. Eugene got the makefile built going after a few hiccups, but the Python version required more RAM than was available, and we couldn’t plug in a USB drive remotely!

Read the whole thing, even if for the science experiment aspect.

# Recursion In Python

2017-08-15

Recursion is a topic in mathematics and computer science. In computer programming languages, the term recursion refers to a function that calls itself. Another way of putting it would be a function definition that includes the function itself in its definition. One of the first warnings I received when my computer science professor talked about recursion was that you can accidentally create an infinite loop that will make your application hang. This can happen because when you use recursion, your function may end up invoking itself infinitely. So, as with any other potential infinite loop, you need to make sure you have a way to break out of the loop. The idea in most recursive functions is to break up the procedure being done into smaller pieces that we can still process with the same function.

Read on for a couple quick recursion scenarios.

# Diamond: Solving Generalized Linear Models Using Python

2017-08-11

Tim Sweester and Aaron Bradley announce Diamond, a Python library which solves certain kinds of generalized linear models.  In a two-part series, they explain more.  Part 1 covers the mathematical principles behind it:

Many computational problems in data science and statistics can be cast as convex problems. There are many advantages to doing so:

• Convex problems have a unique global solution, i.e. there is one best answer
• There are well-known, efficient, and reliable algorithms for finding it

One ubiquitous example of a convex problem in data science is finding the coefficients of an L2L2-regularized logistic regression model using maximum likelihood. In this post, we’ll talk about some basic algorithms for convex optimization, and discuss our attempts to make them scale up to the size of our models. Unlike many applications, the “scale” challenge we faced was not the number of observations, but the number of features in our datasets. First, let’s review the model we want to fit.

In this example, GLMMs allow you to pool information across different brands, while still learning individual effects for each brand. It breaks the problem into sets of fixed and random effects. The fixed effects are similar to what you would find in a traditional logistic regression model, while the random effects allow the regression relationship to vary for each brand. One of the advantages of GLMMs is that they learn how different brands are from each other. Brands that are very similar to the overall average will have small random effect estimates. Because of the regularization of these models, brands with few observations will also have small random effect estimates, and be treated more like the overall average. In contrast, for brands that are very different from the average, with lots of data to support that, GLMMs will learn large random effect estimates.

Check it out.  Part 2 also contains a link to the GitHub repo if you want to try it on your own.

# Installing Python Support In SQL Server

2017-07-13

The process for using Python in SQL Server is very similar to the previous process of installing R.  Microsoft renamed R Services to Machine Learning Services, and now allows both R and Python to be installed, as shown in the screen.  Microsoft’s version of Python uses Anaconda, which is an open source analytics platform created by Continuum. This is where Python differs from other open source languages, as Continuum is providing the version of Python as it contains data science components which are not included in the standard distribution of Python. Continuum also sells an enterprise version of Anaconda, with of course more features than come with the free version. It is important to remember the python environment as you will need select the same distribution when running Python code outside of SQL Server.

Read on to see how to install Python support in SQL Server 2017 and for a few links to tools.

# Building Graph Tables

2017-06-29

To put the graph database to the test, I took bunch of emails from a particular MVP SQL Server distribution list (content will not be shown and all the names will be anonymized). On my gmail account, I have downloaded some 90MiB of emails in mbox file format. With some python scripting,  only FROM and SUBJECTS were extracted:

writer.writerow(['from','subject'])
for index, message in enumerate(mailbox.mbox(infile)): content = get_content(message) row = [ message['from'].strip('>').split('<')[-1], decode_header(message['subject'])[0][0],"|" ] writer.writerow(row)

This post walks you through loading data, mostly.  But at the end, you can see how easy it is to find who replied to whose e-mails.

# Using Python In SQL Server 2017

2017-06-22

Python supports a limited number of data types in comparison to SQL Server. As a result, whenever you use data from SQL Server in Python scripts, the data might be implicitly converted to a type compatible with Python. However, often an exact conversion cannot be performed automatically, and an error is returned. This table lists the implicit conversions that are provided. Other data types are not supported.

This article will get you started, and from there, the wide world of Anaconda awaits you.

# Random Forests In scikit-learn

2017-06-20

As I mentioned in a blog post a couple of weeks ago, I’ve been playing around with the Kaggle House Prices competition and the most recent thing I tried was training a random forest regressor.

Unfortunately, although it gave me better results locally it got a worse score on the unseen data, which I figured meant I’d overfitted the model.

I wasn’t really sure how to work out if that theory was true or not, but by chance, I was reading Chris Albon’s blog and found a post where he explains how to inspect the importance of every feature in a random forest. Just what I needed!

There’s a nagging voice in my head saying “Principal Component Analysis” as I read this post.

September 2017
MTWTFSS
« Aug
123
45678910
11121314151617
18192021222324
252627282930