Spark Notebook Workflows

Dave Wang, Eric Liang, and Maddie Schults introduce Notebook Workflows:

Notebooks are very helpful in building a pipeline even with compiled artifacts. Being able to visualize data and interactively experiment with transformations makes it much easier to write code in small, testable chunks. More importantly, the development of most data pipelines begins with exploration, which is the perfect use case for notebooks. As an example, Yesware regularly uses Databricks Notebooks to prototype new features for their ETL pipeline.

On the flip side, teams also run into problems as they use notebooks to take on more complex data processing tasks:

  • Logic within notebooks becomes harder to organize. Exploratory notebooks start off as simple sequences of Spark commands that run in order. However, it is common to make decisions based on the result of prior steps in a production pipeline – which is often at odds with how notebooks are written during the initial exploration.
  • Notebooks are not modular enough. Teams need the ability to retry only a subset of a data pipeline so that a failure does not require re-running the entire pipeline.

These are the common reasons that teams often re-implement notebook code for production. The re-implementation process is time-consuming, tedious, and negates the interactive properties of notebooks.

Those two reasons are why I’ve argued that you should sit down in front of a REPL and figure out what you’re doing with a particular data set.  Once you’ve got it figured out, perform the operations in a notebook for posterity and to replicate your actions later.  I’m curious to see how this gets adopted in practice.

Related Posts

Building TensorFlow Neural Networks On Spark With Keras

Jules Damji has an example of using the PyCharm IDE to use Keras to build TensorFlow neural network models on the Databricks MLflow library: Our example in the video is a simple Keras network, modified from Keras Model Examples, that creates a simple multi-layer binary classification model with a couple of hidden and dropout layers and […]

Read More

Sharing R Notebooks

Hanyu Cui and Hossein Falaki show how to share a notebook using RMarkdown: RMarkdown is the dynamic document format RStudio uses. It is normal Markdown plus embedded R (or any other language) code that can be executed to produce outputs, including tables and charts, within the document. Hence, after changing your R code, you can just rerun all […]

Read More

Categories

September 2016
MTWTFSS
« Aug Oct »
 1234
567891011
12131415161718
19202122232425
2627282930