There’s a new feature in Azure, and I stumbled on it when someone posted a link on Twitter. Apologies, I can’t remember who, but I did click on the Azure Notebooks link and was intrigued. I’ve gotten Jupyter notebooks running on my local laptop, but these are often just on one machine. Having a place to share a notebook in the cloud is cool.
Once I clicked on the link, I found these are both R and Python notebooks, as well as F#. These allow you to essentially build a page of code and share it. It’s kind of like a REPL, kind of like a story. It’s a neat way of working through a problem. I clicked the Get Started link to get going and was prompted for a User ID.
I’m a major fan of using notebooks for validating results as well as training people.
No one can deny how large the online support community for data science is. Today, it’s possible to teach yourself Python and other programming languages in a matter of weeks. And if you’re ever in doubt, there’s a StackOverflow thread or something similar waiting to give you the perfect piece of code to help you.
But when it came to pushing it to production, we found very little documentation online. Most data scientists seem to work on Python notebooks in a silo. They process large volumes of data and analyze it — but within the confines of Jupyter Notebooks. And most of the resources we’ve found while growing as data scientists revolve around Jupyter Notebooks.
Another option might be to use JupyterHub.
JupyterLab is an interactive development environment for working with notebooks, code and data. Most importantly, JupyterLab has full support for Jupyter notebooks. Additionally, JupyterLab enables you to use text editors, terminals, data file viewers, and other custom components side by side with notebooks in a tabbed work area.
JupyterLab provides a high level of integration between notebooks, documents, and activities:
Drag-and-drop to reorder notebook cells and copy them between notebooks.
Run code blocks interactively from text files (.py, .R, .md, .tex, etc.).
Link a code console to a notebook kernel to explore code interactively without cluttering up the notebook with temporary scratch work.
Edit popular file formats with live preview, such as Markdown, JSON, CSV, Vega, VegaLite, and more.
I like this, as I’m a big fan of notebooks but sometimes you just want to write some diagnostic queries and an IDE is way better for that. H/T Giovanni Lanzani
This can be downloaded from here. Unzip and run the jupyter-scala.ps1 script on windows using elevated permissions in order to install.
The kernel files will end up in <UserProfileDir>\AppData\Roaming\jupyter\kernels\scala-develop and the kernel will appear in Jupyter with the default name of ‘Scala (develop)’. You can of course change this in the respective kernel.json file.
Click through to see how to install a few other kernels with various levels of configuration.
The Jupyter notebok environment consists of a browser-based notebook UI and a back-end server, running on port 8888 by default (if this port is taken it will start up on the next available port). This web server-based delivery of Notebooks means that you can browse to a remote server and execute your code there. This is the case, for example, when using a ready-made cluster such as an HDInsight Spark cluster, where all the tooling has been pre-installed for you. You open the notebook in the cluster portal within Azure, and it logs you in to the Jupyter server running on a node within the cluster. Note that if you want to allow multi-user access to your local Jupyter environment, you’ll need to be running a product such as JupyterHub.
I love using Jupyter when presenting because it’s the easiest way to intermix code, documentation, and images in one package, so it’s nice for pedagogical purposes.
Binder lets you easily host interactive Jupyter notebooks and let anyone on the internet use them interactively immediately! It uses JupyterHub under the hood.
If you want to try it out, you can do that right now:
- Go to https://mybinder.org/v2/gh/jvns/pandas-cookbook/master (which will launch the github.com/jvns/pandas-cookbook repository)
- Wait for it to build and click ‘launch’
- click ‘cookbook’, click a notebook, and play around! There’s an “A quick tour of the IPython Notebook” notebook which shows off some of the basic features.
It apparently uses Kubernetes + Docker under the hood which is interesting! It must be much much more expensive to run than the read-only services, but it’s such a useful and cool thing! I hope it continues to exist.
Definitely worth checking out. I’m going to have to see the steps for getting an R runtime so I can post some of my own notebook repos.
We start with a 16.04 image, we run some upgrades, install python, upgrade pip, install our requirements and expose port 8888 (jupyter’s default port).
Here is our requirements.txt file
Notice how Jupyter is in there, I also added a few other things that I very commonly use including numpy, pandas, plotly, scikit-learn and some azure stuff.
The big benefit to doing this is that your installation of Jupyter can exist independently from your notebooks, so if you accidentally mess up Jupyter, you kill and reload from the image in a couple commands.
For the rest of this post, I assume that you have some basic familiarity with Python, Pandas and Jupyter.
On your machine, you will need all of the following installed:
Python 2 or 3 with Pip
Amit shows two separate methods for retrieving data, so check it out.
One of the nifty things about using R is that you can use it for many different purposes and even other languages!
If you want to use Python in your knitr docs or the newish RStudio R notebook functionality, you might encounter some fiddliness getting all the moving parts running on Windows. This is a quick knitr Python Windows setup checklist to make sure you don’t miss any important steps.
Between knitr, Zeppelin, and Jupyter, you should be able to find a cross-compatible notebook which works for you.
By default (with no --password and --port arguments), Jupyter will run on port 8888 with no password protection; JupyterHub will run on port 8000. The --port and --jupyterhub-port arguments can be used to override the default ports to avoid conflicts with other applications.
The --r option installs the IRKernel for R. It also installs SparkR and sparklyr for R, so make sure Spark is one of the selected EMR applications to be installed. You’ll need the Spark application if you use the --toree argument.
If you used --jupyterhub, use Linux users to sign in to JupyterHub. (Be sure to create passwords for the Linux users first.) hadoop, the default admin user for JupyterHub, can be used to set up other users. The –password option sets the password for Jupyter and for the hadoop user for JupyterHub.
Installation is fairly straightforward, and they include a series of samples you can get to try out Jupyter.