For this analysis I’m using the SAS Open Source library called SWAT (Scripting Wrapper for Analytics Transfer) to code in Python and execute SAS CAS Action Sets. SWAT acts as a bridge between the python language to CAS Action Sets. CAS Action Sets are synonymous to libraries in Python or packages in R. The one main difference and benefit is that the algorithms within these action sets have been highly parallelized to run on a CAS (Cloud Analytic Services) server. The CAS server is a distributed in-memory engine where I can do all my heavy lifting or computations. The code and Jupyter Notebook are available on GitHub.
Click through for the analysis.
After a small bit of research I discovered the concept of monkey patching (modifying a program to extend its local execution) the DataFrame object to include a transform function. This function is missing from PySpark but does exist as part of the Scala language already.
The following code can be used to achieve this, and can be stored in a generic wrapper functions notebook to separate it out from your main code. This can then be called to import the functions whenever you need them.
Things which make Python more of a functional language are fine by me. Even though I’d rather use Scala.
H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, however, the latter follows the conventions of Apache Spark’s MLlib library and allows you to build machine learning pipelines that include MLlib transformers. We will focus on the latter API in this post.
H2OAutoML supports classification and regression. The ML models built and tuned by H2OAutoML include Random Forests, Gradient Boosting Machines, Deep Neural Nets, Generalized Linear Models, and Stacked Ensembles.
The post only has a few lines of code but there are a lot of working parts under the surface.
We can see that there are no libraries installed and scoped specifically to this notebook. Now I’m going to install a later version of SciPy, restart the python interpreter, and then run that same helper function we ran previously to list any libraries installed and scoped specifically to this notebook session. When using the list() function PyPI libraries scoped to this notebook session are displayed as <library_name>-<version_number>-<repo>, and (empty) indicates that the corresponding part has no specification. This also works with wheel and egg install artifacts, but for the sake of this example we’ll just be installing the single package directly.
This does seem easier than dropping to a shell and installing with Pip, especially if you need different versions of libraries.
We leverage the power of HDP 3.0 from efficient storage (erasure coding), GPU pooling to containerized TensorFlow and Zeppelin to enable this use case. We will the save the details for a different blog (please see the video)- to summarize, as we trained the car on a track, we collected about 30K images with corresponding steering angle data. The training data was stored in a HDP 3.0 cluster and the TensorFlow model was trained using 6 GPU cards and then the model was deployed back on the car. The deep learning use case highlights the combined power of HDP 3.0.
Click through for more additions and demos.
There’s no video yet available of Joel’s talk, but you can guess the theme of that opening slide, and walking through the slides conveys the message well, I think. Yuhui Xie, author and creator of the rmarkdown package, provides a detailed summary and response to Joel’s talk, where he lists Joel’s main critiques of Notebooks:
Hidden state and out-of-order execution
Notebooks are difficult for beginners
Notebooks encourage bad habits
Notebooks discourage modularity and testing
Jupyter’s autocomplete, linting, and way of looking up the help are awkward
Notebooks encourage bad processes
Notebooks hinder reproducible + extensible science
Notebooks make it hard to copy and paste into Slack/Github issues
Errors will always halt execution
Notebooks make it easy to teach poorly
Notebooks make it hard to teach well
Read the whole thing. I agree with some of these points, but disagree with a few on the list.
On the surface, notebooks pose a lot of challenges: they’re frequently changed, their cell outputs need not match the code, they’re difficult to test, and there’s no easy way to dynamically configure their execution. Furthermore, you need a notebook server to run them, which creates architectural dependencies to facilitate execution. These issues caused some initial push-back internally at the idea. But that has changed as we’ve brought in new tools to our notebook ecosystem.
The biggest game-changer for us is Papermill. Papermill is an nteract library built for configurable and reliable execution of notebooks with production ecosystems in mind. What Papermill does is rather simple. It take a notebook path and some parameter inputs, then executes the requested notebook with the rendered input. As each cell executes, it saves the resulting artifact to an isolated output notebook.
Papermill does look quite interesting.
Notebooks were first introduced at Netflix to support data science workflows. As their adoption grew among data scientists, we saw an opportunity to scale our tooling efforts. We realized we could leverage the versatility and architecture of Jupyter notebooks and extend it for general data access. In Q3 2017 we began this work in earnest, elevating notebooks from a niche tool to a first-class citizen of the data platform.
From our users’ perspective, notebooks offer a convenient interface for iteratively running code, exploring output, and visualizing data — all from a single cloud-based development environment. We also maintain a Python library that consolidates access to platform APIs. This means users have programmatic access to virtually the entire platform from within a notebook.Because of this combination of versatility, power, and ease of use, we’ve seen rapid organic adoption for all user types across the entire Data Platform.
Today, notebooks are the most popular tool for working with data at Netflix.
Good article. I love notebooks for two reasons: pedagogical purposes (it’s easier to show a demo in a notebook) and forcing you to work linearly.
How can you get the interactivity back and get our changes immediately in our Notebook? Add
%autoreloadat the top of your Notebook:%loadext autoreload # Load the extension%autoreload 2 # Autoreload all modules
%autoreloadis a Jupyter extension that reloads modules before executing your code. Functions and classes loaded in notebooks get their functionality updated every time you execute a cell. This means that when new code is saved in the editor, the changes are immediately loaded in your Notebook if you run a cell.
%autoreloadbridges the gap between Notebook and IDE. You gain all the benefits of an IDE, but you’re still as flexible as before! See the GIF at the top as an example.
That’s a useful trick. I’ve tended to use notebooks more for post-hoc work, where I’ve already structured my code and want to formalize it for others to use.
Let’s look at a concrete example with the Click-Through Rate Prediction dataset of ad impressions and clicks from the data science website Kaggle. The goal of this workflow is to create a machine learning model that, given a new ad impression, predicts whether or not there will be a click.
To build our advanced analytics workflow, let’s focus on the three main steps:
Data Exploration, for example, using SQL
Advanced Analytics / Machine Learning
The Databricks blog has a couple other examples, but this was the most interesting one for me.