My goal is to do some of the things that I did in my Touching on Advanced Topics post. Originally, I wanted to replicate that analysis in its entirety using Zeppelin, but this proved to be pretty difficult, for reasons that I mention below. As a result, I was only able to do some—but not all—of the anticipated work. I think a more seasoned R / SparkR practitioner could do what I wanted, but that’s not me, at least not today.
With that in mind, let’s start messing around.
SparkR is a bit of a mindset change from traditional R.
Remember chemistry class in high school or college? You might remember having to keep a lab notebook for your experiments. The purpose of this notebook was two-fold: first, so you could remember what you did and why you did each step; second, so others could repeat what you did. A well-done lab notebook has all you need to replicate an experiment, and independent replication is a huge part of what makes hard sciences “hard.”
Take that concept and apply it to statistical analysis of data, and you get the type of notebook I’m talking about here. You start with a data set, perform cleansing activities, potentially prune elements (e.g., getting rid of rows with missing values), calculate descriptive statistics, and apply models to the data set.
I didn’t realize just how useful notebooks were until I started using them regularly.
There are a couple of notes with these clusters:
These are not powerful clusters. Don’t expect to crunch huge data sets with them. Notice that the cluster has only 6 GB of RAM, so you can expect to get maybe a few GB of data max.
The cluster will automatically terminate after one hour without activity. The paid version does not have this limitation.
You interact with the cluster using notebooks rather than opening a command prompt. In practice, this makes interacting with the cluster a little more difficult, as a good command prompt can provide features such as auto-complete.
Databricks Community Edition has a nice interface, is very easy to get up and running and—most importantly—is free. Read the whole thing.
Over the last year, there have been several key improvements to Apache Zeppelin that have been contributed by a diverse group of developers. Some of the highlights are:
- Security Features-Authentication, Access Control, LDAP Support
- Sharing Collaboration- Notebook import/export
- Interpreters-Noteable R interpreter, and others too numerous to list
The pluggable nature of the Apache Zeppelin interpreter architecture has made it easy to add support for interpreters. Now there are over 30 interpreters supporting everything from Spark, Hive, MySql, and R to things like Geode and HBase.
It’s an exciting time to be in the world of data analysis.