One of the nifty things about using R is that you can use it for many different purposes and even other languages!
If you want to use Python in your knitr docs or the newish RStudio R notebook functionality, you might encounter some fiddliness getting all the moving parts running on Windows. This is a quick knitr Python Windows setup checklist to make sure you don’t miss any important steps.
Between knitr, Zeppelin, and Jupyter, you should be able to find a cross-compatible notebook which works for you.
Within Machine Learning many tasks are – or can be reformulated as – classification tasks.
In classification tasks we are trying to produce a model which can give the correlation between the input data and the class each input belongs to. This model is formed with the feature-values of the input-data. For example, the dataset contains datapoints belonging to the classes Apples, Pears and Oranges and based on the features of the datapoints (weight, color, size etc) we are trying to predict the class.
Ahmet has his entire post saved as a Jupyter notebook.
Programming is one of the five main competence areas at the base of the skill set for a Data Scientist, even if is not the most relevant in term of expertise (see What is the right mix of competences for Data Scientists?). Considering the results of the survey, that involved more than 200 Data Scientist worldwide until today, there isn’t a prevailing choice among the programming languages used during the data science’s activities. However, the choice appears to be addressed mainly to a limited set of alternatives: almost 96% of respondents affirm to use at least one of R, SQL or Python.
These results don’t surprise me much. R has slightly more traction than Python, but the percentage of people using both is likely to increase. SQL, meanwhile, is vital for getting data, and as we’re seeing in the Hadoop space, as data platform products get more mature, they tend to gravitate toward a SQL or SQL-like language. Cf. Hive, Spark SQL, Phoenix, etc.
However, there were other errors, which I suspect are related to Python 2.7 v Pyhton 3.5. Rather than solve those, I went on to the columnstore demo. In this, you create a table with 5mm rows and then run a query against it from Python. I did that, then created the columnstore index, then ran it again. The results are below.
And within an hour or so of starting, Steve has hit the 2.x vs 3.x mess in Python.
This week, were are now making announcing even more support for Python. As of today Python is now a first-class language supported by our management SDKs. This enables you to develop applications or automate the Data Lake services. Check out or Getting Started articles that now include many python samples
Saveen has a Jupyter notebook which demonstrates Python in Azure Data Lake Store.
For this section we previously installed the python module pyodbc which is needed to connect via ODBC to any SQL Server on the network giving the proper authentication method.
The following sample code can be found this link: https://www.microsoft.com/en-us/sql-server/developer-get-started/python-ubuntu
This is probably more useful in larger shops with multiple operations personnel covering different domains, but it’s nice to know that both languages play nice.
Every week, someone on Reddit posts a “word cloud” on all of the NFL team’s subreddits. These word clouds show the most used words on that subreddit for the week (the larger the word, the more it was used). These word plots are always really fascinating to me, so I wanted to try to make some for myself. In this tutorial, we’ll be making the following word cloud from my board game stats twitter feed, @BGGStats
Looks like the implementation is fairly straightforward, so check it out.
The path of bringing a trained model from the local Python/Anaconda environment towards cloud Azure ML is globally as follows:
Export the trained model
Zip the exported files
Upload to the Azure ML environment
Embed in your Azure ML solution
Click through to see the details. Koos did a great job making it look easy.
Next, I wanted to make the alerts be a little more meaningful. The alert for a scoring play was already pretty good – it sends something like: BUF – Q4 – TD – J.Boykin 4 yd. pass from C.Jones (pass failed) Drive: 8 plays, 83 yards in 1:08 IND (19) at BUF (18). This is good, and in fact it is what I want the rest of the alerts to look like. However, I’d like the subject of the email to have the name of the team that scored (before it was just ‘Scoring Play’).
To do that, I needed to find out how to get the name of the scoring team. This was a little tricky because the documentation for the nflgame library, though pretty good, doesn’t give a good indication on how to find this.
Read on for more details, including specifics on turnovers and penalties.
Python is often used in conjunction with the scikit-learn collection of libraries. The most important libraries used for ML in Python are grouped inside a distribution called Anaconda. This is the distribution that’s also used inside Azure ML1. Besides Python and scikit-learn, Anaconda contains all kinds of Data Science-oriented packages. It’s a good idea to install Anaconda as a distribution and use Jupyter (formerly IPython) as development environment: Anaconda gives you almost the same environment on your local machine as your code will run in once in Azure ML. Jupyter gives you a nice way to keep code (in Python) and write / document (in Markdown) together.
Anaconda can be downloaded from https://www.continuum.io/downloads.
If you’re going down this path, Anaconda is absolutely a great choice.