Spark Notebook Workflows

Dave Wang, Eric Liang, and Maddie Schults introduce Notebook Workflows:

Notebooks are very helpful in building a pipeline even with compiled artifacts. Being able to visualize data and interactively experiment with transformations makes it much easier to write code in small, testable chunks. More importantly, the development of most data pipelines begins with exploration, which is the perfect use case for notebooks. As an example, Yesware regularly uses Databricks Notebooks to prototype new features for their ETL pipeline.

On the flip side, teams also run into problems as they use notebooks to take on more complex data processing tasks:

  • Logic within notebooks becomes harder to organize. Exploratory notebooks start off as simple sequences of Spark commands that run in order. However, it is common to make decisions based on the result of prior steps in a production pipeline – which is often at odds with how notebooks are written during the initial exploration.
  • Notebooks are not modular enough. Teams need the ability to retry only a subset of a data pipeline so that a failure does not require re-running the entire pipeline.

These are the common reasons that teams often re-implement notebook code for production. The re-implementation process is time-consuming, tedious, and negates the interactive properties of notebooks.

Those two reasons are why I’ve argued that you should sit down in front of a REPL and figure out what you’re doing with a particular data set.  Once you’ve got it figured out, perform the operations in a notebook for posterity and to replicate your actions later.  I’m curious to see how this gets adopted in practice.

Related Posts

Error Handling In Scala

Manish Mishra gives a few examples of how to handle errors in Scala: Try[T] is another construct to capture the success or a failure scenarios. It returns a value in both cases. Put any expression in Try and it will return Success[T] if the expression is successfully evaluated and will return Failure[T] in the other case […]

Read More

When Spark Meets Hive

Anna Martin and Rosaria Silipo look at combining HiveQL and SparkQL: We set our goal here to investigate the age distribution of Maine residents, men and women, using SQL queries. But the question is… on Apache Hive or on Apache Spark? Well, why not both? We could use SparkSQL to extract men’s age distribution and […]

Read More


September 2016
« Aug Oct »