BDD In Spark

Aaron Colcord and Zachary Nanfelt explain how to use Cucumber to create behavior-driven development tests on Apache Spark:

Cucumber allows us to write a portion of our software in a simple, language-based approach that enables all team members to easily read the unit tests. Our focus is on detailing the results we want the system to return. Non-Technical members of the team can easily create, read, and validate the testing of the system.

Often Apache Spark is one component among many in processing data and this can encourage multiple testing frameworks. Cucumber can help us provides a consistent unit testing strategy when the project may extend past Apache Spark for data processing. Instead of mixing the different unit testing strategies between sub-projects, we create one readable agile acceptance framework. This is creating a form of ‘Automated Acceptance Testing’.

Best of all, we are able to create ‘living documentation’ produced during development. Rather than a separate Documentation process, the Unit Tests form a readable document that can be made readable to external parties. Each time the code is updated, the Documentation is updated. It is a true win-win.

It’s an interesting mix.  I’m not the biggest fan of BDD but I’m happy that this information is out there.

Related Posts

Excluding Checks With dbachecks

Garry Bargsley shows us how to set a config which lets us exclude particular checks when running dbachecks: While tweaking my Invoke-DbcCheck  the list of  -ExcludeCheck checks keeps growing and growing. 1 Invoke-DbcCheck -SqlInstance $Servers -ComputerName $Servers -Check $_ -ExcludeDatabase ReportServer, ReportServerTempDB -ExcludeCheck TestLastBackup, TestLastBackupVerifyOnly, LinkedServerConnection, SPN, MaintenanceSolution, SaRenamed, LastGoodCheckDb, LogShipping, InvalidDatabaseOwner -PassThru | Update-DbcPowerBiDataSource -Environment Production Sure […]

Read More

Apache Spark 2.3

The Databricks team has been busy.  They’ve recently announced Apache Spark 2.3 on Databricks: Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.3 marks a major milestone for Structured Streaming by introducing low-latency continuous processing and stream-to-stream joins; boosts PySpark by improving performance with pandas UDFs; and runs on Kubernetes clusters […]

Read More


June 2017
« May Jul »