Jules Damji explains when to use RDDs, when to use DataFrames, and when to use Datasets in Spark:
Like an RDD, a DataFrame is an immutable distributed collection of data. Unlike an RDD, data is organized into named columns, like a table in a relational database. Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction; it provides a domain specific language API to manipulate your distributed data; and makes Spark accessible to a wider audience, beyond specialized data engineers.
With Spark 2.0, the balance moves in favor of the more structured data types. What’s old is new; what’s unstructured is structured…
Comments closed