Speed Up Your Spark Queries

Kevin Feasel


R, Spark

John Mount has some good advice for R users running Spark queries:

For some time we have been teaching R users “when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis.”

The idea behind the advice is: working with fewer columns makes for quicker queries.

The issue arises because wide tables (200 to 1000 columns) are quite common in big-data analytics projects. Often these are “denormalized marts” that are used to drive many different projects. For any one project only a small subset of the columns may be relevant in a calculation.

Some wonder is this really an issue or is it something one can ignore in the hope the downstream query optimizer fixes the problem. In this note we will show the effect is real.

This is good advice for more than just dealing with R on Spark.

Related Posts

R 3.5.0 Released

Tal Galili announces that R 3.5.0 is now available: By default the (arbitrary) signs of the loadings from princomp() are chosen so the first element is non-negative. If –default-packages is not used, then Rscript now checks the environment variable R_SCRIPT_DEFAULT_PACKAGES. If this is set, then it takes precedence over R_DEFAULT_PACKAGES. If default packages are not specified on the command line or by one […]

Read More

Issues Starting ML Services

Jen Stirrup has a quick rundown of some reasons why Machine Learning Services might give you an error when you try to start it up: Msg 39023, Level 16, State 1, Procedure sp_execute_external_script, Line 1 [Batch Start Line 3] ‘sp_execute_external_script’ is disabled on this instance of SQL Server. Use sp_configure ‘external scripts enabled’ to enable […]

Read More


December 2017
« Nov Jan »