Spark Notebook Workflows

Dave Wang, Eric Liang, and Maddie Schults introduce Notebook Workflows:

Notebooks are very helpful in building a pipeline even with compiled artifacts. Being able to visualize data and interactively experiment with transformations makes it much easier to write code in small, testable chunks. More importantly, the development of most data pipelines begins with exploration, which is the perfect use case for notebooks. As an example, Yesware regularly uses Databricks Notebooks to prototype new features for their ETL pipeline.

On the flip side, teams also run into problems as they use notebooks to take on more complex data processing tasks:

  • Logic within notebooks becomes harder to organize. Exploratory notebooks start off as simple sequences of Spark commands that run in order. However, it is common to make decisions based on the result of prior steps in a production pipeline – which is often at odds with how notebooks are written during the initial exploration.
  • Notebooks are not modular enough. Teams need the ability to retry only a subset of a data pipeline so that a failure does not require re-running the entire pipeline.

These are the common reasons that teams often re-implement notebook code for production. The re-implementation process is time-consuming, tedious, and negates the interactive properties of notebooks.

Those two reasons are why I’ve argued that you should sit down in front of a REPL and figure out what you’re doing with a particular data set.  Once you’ve got it figured out, perform the operations in a notebook for posterity and to replicate your actions later.  I’m curious to see how this gets adopted in practice.

Related Posts

Processing Fixed-Width Files with Spark

Subhasish Guha shows how you can read a fixed-with file with Apache Spark: A fixed width file is a very common flat file format when working with SAP, Mainframe, and Web Logs. Converting the data into a dataframe using metadata is always a challenge for Spark Developers. This particular article talks about all kinds of […]

Read More

Sentiment Analysis with Spark on Qubole

Jonathan Day, et al, have a tutorial on using Qubole to build a sentiment analysis model: This post covers the use of Qubole, Zeppelin, PySpark, and H2O PySparkling to develop a sentiment analysis model capable of providing real-time alerts on customer product reviews. In particular, this model allows users to monitor any natural language text […]

Read More

Categories

September 2016
MTWTFSS
« Aug Oct »
 1234
567891011
12131415161718
19202122232425
2627282930