H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, however, the latter follows the conventions of Apache Spark’s MLlib library and allows you to build machine learning pipelines that include MLlib transformers. We will focus on the latter API in this post.
H2OAutoML supports classification and regression. The ML models built and tuned by H2OAutoML include Random Forests, Gradient Boosting Machines, Deep Neural Nets, Generalized Linear Models, and Stacked Ensembles.
The post only has a few lines of code but there are a lot of working parts under the surface.
In my functions file, I created 3 new functions for controlling the lights: invoke-setLight, invoke-Pulse, and invoke-Breathe. To understand what the API was expecting, I followed the Lifx API documentation here. As far as API documentation goes, this one is pretty good. Most functions have an interactive portion at the bottom which allows you to test it out yourself and also see what inputs the API expects.
As a Bills fan, at least I wouldn’t have to worry about the lights wearing out from overuse.
Today is a great day for Apache Airflow as it graduates from incubating status to a Top-Level Apache project. This is the next step of maturity for Airflow. For those unfamiliar, Airflow is an orchestration tool to schedule and orchestrate your data workflows. From ETL to training of models, or any other arbitrary tasks. Unlike other orchestrators, everything is written in Python, which makes it easy to use for both engineers and scientists. Having everything in code means that it is easy to version and maintain.
Airflow has been getting some hype lately, especially in the AWS space.
We can see that there are no libraries installed and scoped specifically to this notebook. Now I’m going to install a later version of SciPy, restart the python interpreter, and then run that same helper function we ran previously to list any libraries installed and scoped specifically to this notebook session. When using the list() function PyPI libraries scoped to this notebook session are displayed as <library_name>-<version_number>-<repo>, and (empty) indicates that the corresponding part has no specification. This also works with wheel and egg install artifacts, but for the sake of this example we’ll just be installing the single package directly.
This does seem easier than dropping to a shell and installing with Pip, especially if you need different versions of libraries.
First things first, you’ll want to choose your version of SQL Server. Python is available on 2017 and greater. For this demo I’ll be using SQL Server 2019 Developer Edition (CTP 2.2).
With 2019 CTP2.2 they’ve increased the requirement of your OS too, in my example I had a spare VM with Windows Server 2012 laying around but I needed to update this to Server 2016. Check the relevant documentation for the version you’re using.
Click through for a step by step guide with plenty of screenshots.
I’m really amazed at how easy creating the container was. It took only 11 lines to spin up a Linux environment on my own machine. The majority of the commands (7 of the 11) are simply adding the files and dependencies. I’m also pretty shocked that I didn’t have to do anything to my Python script to get this to work. I had assumed I would need to do something but, I didn’t. Very cool! Also, by using the following command while my Python script is running, I see that this is only taking up 1.3 GB!
Click through for scripts and important lessons learned along the way.
Last month, I delivered the one-day workshop Practical AI for the Working Software Engineer at the Artificial Intelligence Live conference in Orlando. As the title suggests, the workshop was aimed at developers, bu I didn’t assume any particular programming language background. In addition to the lecture slides, the workshop was delivered as a series of Jupyter notebooks. I ran them using Azure Notebooks (which meant the participants had nothing to install and very little to set up), but you can run them in any Jupyter environment you like, as long as it has access to R and Python. You can download the notebooks and slides from this Github repository (and feedback is welcome there, too).
Read on for details about those notebooks and to get your own copies.
I prefer it to R mostly because I don’t have to create the csv-file(names) in advance before I import data to it. This is particularly important for scenarios where I want to append data to an existing file. The key for this task is NOT to use the append-option that Python offers, because M-scripts will be executed multiple times and this would create a total mess in my file. Instead I create a new file with the context to append and use the Import-from-folder method instead to stitch all csvs back together. Therefore I have to dynamically create new filenames for each import. So when the M-Python-scripts are executed repetitively here, the newly created file will just be overwritten – which doesn’t do any harm.
Click through for the code as well as a few caveats.
The parallel processing holds two varieties of execution: Synchronous and Asynchronous.
In synchronous execution, once a process starts execution, it puts a lock over the main program until its get accomplished.
While the asynchronous execution doesn’t require locking, it performs a task quickly but the outcome can be in the rearranged order.
Click through for a few examples using Pool.
This post came about due to a question on the Microsoft Machine Learning Server forum. The question was if there are any plans by Microsoft to support more the one input dataset (
sp_execute_external_script. My immediate reaction was that if you want more than one dataset, you can always connect from the script back into the database, and retrieve data.
However, the poster was well aware of that, but due to certain reasons he did not want to do it that way – he wanted to push in the data, fair enough. When I read this, I seemed to remember something from a while ago, where, instead of retrieving data from inside the script, they pushed in the data, serialized it as an output parameter and then used the binary representation as in input parameter (yeah – this sounds confusing, but bear with me). I did some research (read Googling), and found this StackOverflow question, and answer. So for future questions, and for me to remember, I decided to write a blog post about it.
This has been a point of frustration for me. We can name the one input data set, so I’d really like to see true support for input multiple data sets without the need for hacks.