Warning When Using dplyr Mutate

Kevin Feasel


R, Spark

John Mount has a warning if you are using dplyr’s mutate function and connecting to Spark or a database:

If you are using the R dplyr package with a database or with Apache Spark: I respectfully advise you inspect your code to ensure you are not using any values created inside a dplyr::mutate() statement inside the same dplyr::mutate() statement. This has been my coding advice for some time, and it is a simple and safe re-factoring to break up such statements into safer sequences (simply by introducing more dplyr::mutate()s).

I have since encountered a non-signaling (or silent) result corruption version of the issue. We are now advising code inspection as we now have confirmation that not seeing a thrown error is not a reliable indication of correct execution and correct results.

Thanks to John for reporting, and hopefully the dplyr team can fix it.

Related Posts

Databricks Runtime 5.2 Released

Nakul Jamadagni announces Databricks Runtime 5.2: Delta Time TravelTime Travel, released as an Experimental feature, adds the ability to query a snapshot of a table using a timestamp string or a version, using SQL syntax as well as DataFrameReader options for timestamp expressions.Sample codeSELECT count() FROM events TIMESTAMP AS OF timestamp_expressionSELECT count() FROM events VERSION AS OF version Time travel looks a bit like temporal tables in SQL Server.

Read More

Native Math Libraries And Spark ML

Zuling Kang shares with us how we can use native math libraries in netlib-java to speed up certain machine learning algorithms in Apache Spark: Spark’s MLlib uses the Breeze linear algebra package, which depends on netlib-java for optimized numerical processing.  netlib-java is a wrapper for low-level BLAS, LAPACK, and ARPACK libraries. However, due to licensing issues with runtime proprietary binaries, neither the Cloudera distribution of […]

Read More


December 2017
« Nov Jan »