For the rest of this post, I assume that you have some basic familiarity with Python, Pandas and Jupyter.
On your machine, you will need all of the following installed:
Python 2 or 3 with Pip
Amit shows two separate methods for retrieving data, so check it out.
One of the nifty things about using R is that you can use it for many different purposes and even other languages!
If you want to use Python in your knitr docs or the newish RStudio R notebook functionality, you might encounter some fiddliness getting all the moving parts running on Windows. This is a quick knitr Python Windows setup checklist to make sure you don’t miss any important steps.
Between knitr, Zeppelin, and Jupyter, you should be able to find a cross-compatible notebook which works for you.
By default (with no --password and --port arguments), Jupyter will run on port 8888 with no password protection; JupyterHub will run on port 8000. The --port and --jupyterhub-port arguments can be used to override the default ports to avoid conflicts with other applications.
The --r option installs the IRKernel for R. It also installs SparkR and sparklyr for R, so make sure Spark is one of the selected EMR applications to be installed. You’ll need the Spark application if you use the --toree argument.
If you used --jupyterhub, use Linux users to sign in to JupyterHub. (Be sure to create passwords for the Linux users first.) hadoop, the default admin user for JupyterHub, can be used to set up other users. The –password option sets the password for Jupyter and for the hadoop user for JupyterHub.
Installation is fairly straightforward, and they include a series of samples you can get to try out Jupyter.
There are several settings you can adjust. Basically, there are two main files in the ZEPPELIN_DIR\conf :
In the first one you can configure some interpreter settings. In the second more aspects related to the Website, like for instance, the Zeppelin server port (I am using the 8080 but most probably yours is already used by another application)
This is a very clear walkthrough. Jupyter is still easier to install, but Paul’s blog post lowers that Zeppelin installation learning curve.
Here’s an example of how we use git and GitHub. One beautiful new feature of Github is that they now render Jupyter Notebooks automatically in repositories.
When we do our analysis, we do internal reviews of our code and our data science output. We do this with a traditional pull-request approach. When issuing pull-requests, however, looking at the differences between updated .ipynb files, the updates are not rendered in a helpful way. One solution people tend to recommend is to commit the conversion to .py instead. This is great for seeing the differences in the input code (while jettisoning the output), and is useful for seeing the changes. However, when reviewing data science work, it is also incredibly important to see the output itself.
So far, I’ve treated notebooks more as presentation media and used tools like R Studio for tinkering. This shifts my priors a bit.
Notebooks are very helpful in building a pipeline even with compiled artifacts. Being able to visualize data and interactively experiment with transformations makes it much easier to write code in small, testable chunks. More importantly, the development of most data pipelines begins with exploration, which is the perfect use case for notebooks. As an example, Yesware regularly uses Databricks Notebooks to prototype new features for their ETL pipeline.
On the flip side, teams also run into problems as they use notebooks to take on more complex data processing tasks:
- Logic within notebooks becomes harder to organize. Exploratory notebooks start off as simple sequences of Spark commands that run in order. However, it is common to make decisions based on the result of prior steps in a production pipeline – which is often at odds with how notebooks are written during the initial exploration.
- Notebooks are not modular enough. Teams need the ability to retry only a subset of a data pipeline so that a failure does not require re-running the entire pipeline.
These are the common reasons that teams often re-implement notebook code for production. The re-implementation process is time-consuming, tedious, and negates the interactive properties of notebooks.
Those two reasons are why I’ve argued that you should sit down in front of a REPL and figure out what you’re doing with a particular data set. Once you’ve got it figured out, perform the operations in a notebook for posterity and to replicate your actions later. I’m curious to see how this gets adopted in practice.
JupyterLab uses a web-based UI that’s akin to the tab-and-panel interface used in IDEs like Visual Studio or Eclipse. Notebooks, command-line consoles, code editors, language references, and many more items can be arranged in various combinations, powered by the PhosphorJSframework.
“The entire JupyterLab [project] is built as a collection of plugins that talk to kernels for code execution and that can communicate with one another,” the developers wrote. “We hope the community will develop many more plugins for new use cases that go far beyond the basic system.”
It looks like they’re making major changes to keep up with Zeppelin on the back end. The biggest advantage Jupyter had for me over Zeppelin was its installation simplicity, so I hope they keep it just as easy as installing Anaconda and then loading JupyterLab.
My goal is to do some of the things that I did in my Touching on Advanced Topics post. Originally, I wanted to replicate that analysis in its entirety using Zeppelin, but this proved to be pretty difficult, for reasons that I mention below. As a result, I was only able to do some—but not all—of the anticipated work. I think a more seasoned R / SparkR practitioner could do what I wanted, but that’s not me, at least not today.
With that in mind, let’s start messing around.
SparkR is a bit of a mindset change from traditional R.
Remember chemistry class in high school or college? You might remember having to keep a lab notebook for your experiments. The purpose of this notebook was two-fold: first, so you could remember what you did and why you did each step; second, so others could repeat what you did. A well-done lab notebook has all you need to replicate an experiment, and independent replication is a huge part of what makes hard sciences “hard.”
Take that concept and apply it to statistical analysis of data, and you get the type of notebook I’m talking about here. You start with a data set, perform cleansing activities, potentially prune elements (e.g., getting rid of rows with missing values), calculate descriptive statistics, and apply models to the data set.
I didn’t realize just how useful notebooks were until I started using them regularly.
There are a couple of notes with these clusters:
These are not powerful clusters. Don’t expect to crunch huge data sets with them. Notice that the cluster has only 6 GB of RAM, so you can expect to get maybe a few GB of data max.
The cluster will automatically terminate after one hour without activity. The paid version does not have this limitation.
You interact with the cluster using notebooks rather than opening a command prompt. In practice, this makes interacting with the cluster a little more difficult, as a good command prompt can provide features such as auto-complete.
Databricks Community Edition has a nice interface, is very easy to get up and running and—most importantly—is free. Read the whole thing.