SQL Server R Services is now SQL Server Machine Learning Services and supports Python. First, Nagesh Pabbisetty and Sumit Kumar talk about Python support:
The addition of Python builds on the foundation laid for R Services in SQL Server 2016 and extends that mechanism to include Python support for in-database analytics and machine learning. We are renaming R Services to Machine Learning Services, and R and Python are two options under this feature.
The Python integration in SQL Server provides several advantages:
Elimination of data movement: You no longer need to move data from the database to your Python application or model. Instead, you can build Python applications in the database. This eliminates barriers of security, compliance, governance, integrity, and a host of similar issues related to moving vast amounts of data around. This new capability brings Python to the data and runs code inside secure SQL Server using the proven extensibility mechanism built in SQL Server 2016.
Easy deployment: Once you have the Python model ready, deploying it in production is now as easy as embedding it in a T-SQL script, and then any SQL client application can take advantage of Python-based models and intelligence by a simple stored procedure call.
Enterprise-grade performance and scale: You can use SQL Server’s advanced capabilities like in-memory table and column store indexes with the high-performance scalable APIs in RevoScalePy package. RevoScalePy is modeled after RevoScaleR package in SQL Server R Services. Using these with the latest innovations in the open source Python world allows you to bring unparalleled selection, performance, and scale to your SQL Python applications.
Rich extensibility: You can install and run any of the latest open source Python packages in SQL Server to build deep learning and AI applications on huge amounts of data in SQL Server. Installing a Python package in SQL Server is as simple as installing a Python package on your local machine.
Wide availability at no additional costs: Python integration is available in all editions of SQL Server 2017, including the Express edition.
We took the first step with Microsoft R Server 9.0, and this follow on release includes significant innovations such as:
New machine learning enhancements and inclusion of pre-trained cognitive models such as sentiment analysis & image featurizers
SQL Server Machine Learning Services with integrated Python in Preview
Enterprise grade operationalization with real-time scoring and dynamic scaling of VMs
Deep customer & ISV partnerships to deliver the right solutions to customers
A panoply of sources to help you get started with ease
So today it’s my pleasure to announce the first RDBMS with built-in AI—a production-quality Community Technology Preview (CTP 2.0) of SQL Server 2017. In this preview release, we are introducing in-database support for a rich library of machine learning functions, and now for the first time Python support (in addition to R). SQL Server can also leverage NVIDIA GPU-accelerated computing through the Python/R interface to power even the most intensive deep-learning jobs on images, text, and other unstructured data. Developers can implement NVIDIA GPU-accelerated analytics and very sophisticated AI directly in the database server as stored procedures and gain orders of magnitude higher throughput. In addition, developers can use all the rich features of the database management system for concurrency, high-availability, encryption, security, and compliance to build and deploy robust enterprise-grade AI applications.
There’s a lot to digest here.
H2O Flow is an interactive web-based computational user interface where you can combine code execution, text, mathematics, plots and rich media into a single document, much like Jupyter Notebooks. With H2O Flow, you can capture, rerun, annotate, present, and share your workflow. H2O Flow allows you to use H2O interactively to import files, build models, and iteratively improve them. Based on your models, you can make predictions and add rich text to create vignettes of your work – all within Flow’s browser-based environment. In this blog, we will only focus on its visualization part.
H2O FLOW web service lives in the Spark driver and is routed through the HDInsight gateway, so it can only be accessed when the spark application/Notebook is running
You can click the available link in the Jupyter Notebook, or you can directly access this URL:
Setup is pretty easy.
I recently got back from Strata West 2017 (where I ran a very well received workshop on
Spark). One thing that really stood out for me at the exhibition hall was
datashaderfrom Continuum Analytics.
I had the privilege of having Peter Wang himself demonstrate
datashaderfor me and answer a few of my questions.
I am so excited about
datashadercapabilities I literally will not wait for the functionality to be exposed in
rbokeh. I am going to leave my usual
rmarkdownworld and dust off
Jupyter Notebookjust to use
datashaderplotting. This is worth trying, even for diehard
For the moment, it looks like datashader is only available for Python, but it’s coming to R.
First, let’s talk about “zipimport”. Thanks to the adoption of PEP 273 – Python had the ability to import modules from ZIP files since Python 2.3. This ability is called “zipimport” and is a built-in feature of the Python’s existing import statement. Read the zipimport documentation now.
To review the basics.
You create a module (a .py file, etc.)
ZIP up the module into a .zip file
Add the path to the .zip file to sys.path
Then import the module
Read on for the step-by-step process.
Project structures often organically grow to suit people’s needs, leading to different project structures within a team. You can consider yourself lucky if at some point in time you find, or someone in your team finds, a obscure blog post with a somewhat sane structure and enforces it in your team.
Many years ago I stumbled upon ProjectTemplate for R. Since then I’ve tried to get people to use a good project structure. More recently DrivenData (what’s in a name?) released their more generic Cookiecutter Data Science.
The main philosophies of those projects are:
A consistent and well-organized structure allows people to collaborate more easily.
Your analyses should be reproducible and your structure should enable that.
A projects starts from raw data that should never be edited; consider raw data immutable and only edit derived sources.
This is a set of prescriptions and focuses on the phase before the project actually kicks off.
The data is of a current client, therefore I won’t be disclosing any details of it.
Our models make forecasts for different shops of this company. In particular I took 2 shops, one which contains the easiest transactions to predict from all shops, and another with a somewhat more complicated history.
The data consists of real transactions since 2014. Data is daily with the target being the number of transactions executed during a day. There are missing dates in the data when the shop closed, for example New Year’s day and Christmas.
The holidays provided to the API are the same I use in our model. They contain from school vacations or large periods, to single holidays like Christmas Eve. In total, the data contains 46 different holidays.
It looks like Prophet has some limitations but can already make some nice predictions.
One of the nifty things about using R is that you can use it for many different purposes and even other languages!
If you want to use Python in your knitr docs or the newish RStudio R notebook functionality, you might encounter some fiddliness getting all the moving parts running on Windows. This is a quick knitr Python Windows setup checklist to make sure you don’t miss any important steps.
Between knitr, Zeppelin, and Jupyter, you should be able to find a cross-compatible notebook which works for you.
Within Machine Learning many tasks are – or can be reformulated as – classification tasks.
In classification tasks we are trying to produce a model which can give the correlation between the input data and the class each input belongs to. This model is formed with the feature-values of the input-data. For example, the dataset contains datapoints belonging to the classes Apples, Pears and Oranges and based on the features of the datapoints (weight, color, size etc) we are trying to predict the class.
Ahmet has his entire post saved as a Jupyter notebook.
Programming is one of the five main competence areas at the base of the skill set for a Data Scientist, even if is not the most relevant in term of expertise (see What is the right mix of competences for Data Scientists?). Considering the results of the survey, that involved more than 200 Data Scientist worldwide until today, there isn’t a prevailing choice among the programming languages used during the data science’s activities. However, the choice appears to be addressed mainly to a limited set of alternatives: almost 96% of respondents affirm to use at least one of R, SQL or Python.
These results don’t surprise me much. R has slightly more traction than Python, but the percentage of people using both is likely to increase. SQL, meanwhile, is vital for getting data, and as we’re seeing in the Hadoop space, as data platform products get more mature, they tend to gravitate toward a SQL or SQL-like language. Cf. Hive, Spark SQL, Phoenix, etc.
However, there were other errors, which I suspect are related to Python 2.7 v Pyhton 3.5. Rather than solve those, I went on to the columnstore demo. In this, you create a table with 5mm rows and then run a query against it from Python. I did that, then created the columnstore index, then ran it again. The results are below.
And within an hour or so of starting, Steve has hit the 2.x vs 3.x mess in Python.