Press "Enter" to skip to content

Spark RDDs and DataFrames

Ayush Hooda explains the difference between RDDs and DataFrames:

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.

One use of Spark SQL is to execute SQL queries. When running SQL from within another programming language the results will be returned as a Dataset/DataFrame.

Before exploring these APIs, let’s understand the need for these APIs.

I like the piece about RDDs being better at explaining the how than the what.