I’m starting to experiment with Python scripts in SQL Server 2017 using Machine Learning Services (In-Database). The problem is, I don’t know Python. If I run into a Python error, the output I get from SSMS is not looking too helpful. My instincts tell me I’ll be much better off developing and debugging Python code from a development tool. What I settled on was to use Visual Studio along with the Python interpreter that comes with SQL Server 2017 Machine Learning Services. I ran into a few issues that I’ll review here.
The first thing I did was Install Python support in Visual Studio on Windows. This article from Microsoft was simple enough. It worked for me with Visual Studio Community 2015. I quickly created a “PythonApplication1” project and tried Hello World. But I got an error telling me Visual Studio couldn’t find any interpreters.
Click through to read more. With Visual Studio 2017, it’s a bit easier to get started: select the Data Science pack on installation and you’ll get both Python and R support out of the box.
Indeed the points do bounce all over the unit interval, though they more often bounce near one of the ends.
Does that distribution look familiar? You might recognize it from Bayesian statistics. It’s a beta distribution. It’s symmetric, so the two beta distribution parameters are equal. There’s a vertical asymptote on each end, so the parameters are less than 1. In fact, it’s a beta(1/2, 1/2) distribution. It comes up, for example, as the Jeffreys prior for Bernoulli trials.
The graph below adds the beta(1/2, 1/2) density to the histogram to show how well it fits.
It’s an interesting bit of math and statistics, and John provides some Python demo code at the end.
Installing Tensorflow with GPU requires you to have NVIDIA GPU. AMD video cards are not supported with tensorflow. NVIDIA uses low level GPU computing system called CUDA. It is an NVIDIA proprietary software.
One can go the OpenCL way with AMD but as of now it won’t work with tensorflow.
Also, all NVIDIA devices are not supported. Here is a list from the NVIDIA documentation listing the supported GPUs.
By the end of it, Vivek also shows us a simple trained model.
We use the app in question to compare search interest for R data Science versus Python Data Science, see above chart. It looks like until December 2016, R dominated, but fell below Python by early 2017. The above chart displays an interest index, 100 being maximum and 0 being minimum. Click here to access this interactive chart on Google, and check the results for countries other than US, or even for specific regions such as California or New York.
Note that Python always dominated R by a long shot, because it is a general-purpose language, while R is a specialized language. But here, we compare R and Python in the niche context of data science. The map below shows interest for Python (general purpose) per region, using the same Google index in question.
It’s an interesting look at the relative shift between R and Python as a primary language for statistical analysis.
In this blog post, I will discuss the use of deep leaning methods to classify time-series data, without the need to manually engineer features. The example I will consider is the classic Human Activity Recognition (HAR) dataset from the UCI repository. The dataset contains the raw time-series data, as well as a pre-processed one with 561 engineered features. I will compare the performance of typical machine learning algorithms which use engineered features with two deep learning methods (convolutional and recurrent neural networks) and show that deep learning can surpass the performance of the former.
I have used Tensorflow for the implementation and training of the models discussed in this post. In the discussion below, code snippets are provided to explain the implementation. For the complete code, please see my Github repository.
Click through for the samples, or check out the repo, linked above.
We are going to represent the content of a Facebook post using word embeddings and comparing the transformed posts using word mover’s distance. The combination of both have shown lower k-nearest neighbor-document classification error rates compared to other state of the art techniques.
The advantage of word embeddings is that the words which have similar meanings but don’t have any letters in common will still have similar vectors (be close) in the embedded space (e.g. lion and tiger).
There’s a good high-level discussion of techniques in this post.
It’s never been easy to get TensorFlow installed on a Pi though. I had created a makefile script that let you build the C++ part from scratch, but it took several hours to complete and didn’t support Python. Sam Abrahams, an external contributor, did an amazing job maintaining a Python pip wheel for major releases, but building it required you to add swap space on a USB device for your Pi, and took even longer to compile than the makefile approach. Snips managed to get TensorFlow cross-compiling for Rust, but it wasn’t clear how to apply this to other languages.
Plenty of people on the team are Pi enthusiasts, and happily Eugene Brevdo dived in to investigate how we could improve the situation. We knew we wanted to have something that could be run as part of TensorFlow’s Jenkins continuous integration system, which meant building a completely automatic solution that would run with no user intervention. Since having a Pi plugged into a machine to run something like the makefile build would be hard to maintain, we did try using a hosted server from Mythic Beasts. Eugene got the makefile built going after a few hiccups, but the Python version required more RAM than was available, and we couldn’t plug in a USB drive remotely!
Read the whole thing, even if for the science experiment aspect.
Recursion is a topic in mathematics and computer science. In computer programming languages, the term recursion refers to a function that calls itself. Another way of putting it would be a function definition that includes the function itself in its definition. One of the first warnings I received when my computer science professor talked about recursion was that you can accidentally create an infinite loop that will make your application hang. This can happen because when you use recursion, your function may end up invoking itself infinitely. So, as with any other potential infinite loop, you need to make sure you have a way to break out of the loop. The idea in most recursive functions is to break up the procedure being done into smaller pieces that we can still process with the same function.
Read on for a couple quick recursion scenarios.
Tim Sweester and Aaron Bradley announce Diamond, a Python library which solves certain kinds of generalized linear models. In a two-part series, they explain more. Part 1 covers the mathematical principles behind it:
Many computational problems in data science and statistics can be cast as convex problems. There are many advantages to doing so:
- Convex problems have a unique global solution, i.e. there is one best answer
- There are well-known, efficient, and reliable algorithms for finding it
One ubiquitous example of a convex problem in data science is finding the coefficients of an-regularized logistic regression model using maximum likelihood. In this post, we’ll talk about some basic algorithms for convex optimization, and discuss our attempts to make them scale up to the size of our models. Unlike many applications, the “scale” challenge we faced was not the number of observations, but the number of features in our datasets. First, let’s review the model we want to fit.
In this example, GLMMs allow you to pool information across different brands, while still learning individual effects for each brand. It breaks the problem into sets of fixed and random effects. The fixed effects are similar to what you would find in a traditional logistic regression model, while the random effects allow the regression relationship to vary for each brand. One of the advantages of GLMMs is that they learn how different brands are from each other. Brands that are very similar to the overall average will have small random effect estimates. Because of the regularization of these models, brands with few observations will also have small random effect estimates, and be treated more like the overall average. In contrast, for brands that are very different from the average, with lots of data to support that, GLMMs will learn large random effect estimates.
Check it out. Part 2 also contains a link to the GitHub repo if you want to try it on your own.
The process for using Python in SQL Server is very similar to the previous process of installing R. Microsoft renamed R Services to Machine Learning Services, and now allows both R and Python to be installed, as shown in the screen. Microsoft’s version of Python uses Anaconda, which is an open source analytics platform created by Continuum. This is where Python differs from other open source languages, as Continuum is providing the version of Python as it contains data science components which are not included in the standard distribution of Python. Continuum also sells an enterprise version of Anaconda, with of course more features than come with the free version. It is important to remember the python environment as you will need select the same distribution when running Python code outside of SQL Server.
Read on to see how to install Python support in SQL Server 2017 and for a few links to tools.