BDD In Spark

Aaron Colcord and Zachary Nanfelt explain how to use Cucumber to create behavior-driven development tests on Apache Spark:

Cucumber allows us to write a portion of our software in a simple, language-based approach that enables all team members to easily read the unit tests. Our focus is on detailing the results we want the system to return. Non-Technical members of the team can easily create, read, and validate the testing of the system.

Often Apache Spark is one component among many in processing data and this can encourage multiple testing frameworks. Cucumber can help us provides a consistent unit testing strategy when the project may extend past Apache Spark for data processing. Instead of mixing the different unit testing strategies between sub-projects, we create one readable agile acceptance framework. This is creating a form of ‘Automated Acceptance Testing’.

Best of all, we are able to create ‘living documentation’ produced during development. Rather than a separate Documentation process, the Unit Tests form a readable document that can be made readable to external parties. Each time the code is updated, the Documentation is updated. It is a true win-win.

It’s an interesting mix.  I’m not the biggest fan of BDD but I’m happy that this information is out there.

Related Posts

A Simple Example With Spark And Kafka

Gary Dusbabek has a nice example showing how to build a simple application with Spark and Kafka: This is a hands-on tutorial that can be followed along by anyone with programming experience. If your programming skills are rusty, or you are technically minded but new to programming, we have done our best to make this […]

Read More

Scaling Out Random Forest

Denis C. Bauer, et al, explain VariantSpark RF, a random forest algorithm designed for huge numbers of variables: VariantSpark RF starts by randomly assigning subsets of the data to Spark Executors for decision tree building (Fig 1). It then calculates the best split over all nodes and trees simultaneously. This implementation avoids communication bottlenecks between Spark […]

Read More


June 2017
« May Jul »