Speed Up Your Spark Queries

Kevin Feasel

2017-12-22

R, Spark

John Mount has some good advice for R users running Spark queries:

For some time we have been teaching R users “when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis.”

The idea behind the advice is: working with fewer columns makes for quicker queries.

The issue arises because wide tables (200 to 1000 columns) are quite common in big-data analytics projects. Often these are “denormalized marts” that are used to drive many different projects. For any one project only a small subset of the columns may be relevant in a calculation.

Some wonder is this really an issue or is it something one can ignore in the hope the downstream query optimizer fixes the problem. In this note we will show the effect is real.

This is good advice for more than just dealing with R on Spark.

Related Posts

Unit Testing Spark Streaming DStreams

Anuj Saxena shows how to create unit tests for DStreams in Spark Streaming: The method ‘ testOperation ‘ takes the output of the operation performed on the ‘inputPair’ and check whether it is equal to the ‘outputPair’ and just like this, we can test our business logic. This short snippet lets you test your business logic without […]

Read More

The Theory Behind cdata

John Mount has a video explaining the concepts behind cdata: We also have two really nifty articles on the theory and methods: Fluid data reshaping with cdata Coordinatized Data: A Fluid Data Specification Please give it a try! Click through for the video, which I found very helpful in tying together a number of data […]

Read More

Categories

December 2017
MTWTFSS
« Nov Jan »
 123
45678910
11121314151617
18192021222324
25262728293031