Understanding Spark APIs

Kevin Feasel

2016-07-19

Spark

Jules Damji explains when to use RDDs, when to use DataFrames, and when to use Datasets in Spark:

Like an RDD, a DataFrame is an immutable distributed collection of data. Unlike an RDD, data is organized into named columns, like a table in a relational database. Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction; it provides a domain specific language API to manipulate your distributed data; and makes Spark accessible to a wider audience, beyond specialized data engineers.

With Spark 2.0, the balance moves in favor of the more structured data types.  What’s old is new; what’s unstructured is structured…

Related Posts

Databricks Runtime 5.4

Todd Greenstein announces Databricks Runtime 5.4: We’ve partnered with the Data Services team at Amazon to bring the Glue Catalog to Databricks.   Databricks Runtime can now use Glue as a drop-in replacement for the Hive metastore. This provides several immediate benefits:– Simplifies manageability by using the same glue catalog across multiple Databricks workspaces.– Simplifies integrated […]

Read More

When Not to Use Spark

Ramandeep Kaur gives us several cases when it makes sense not to use Apache Spark: There can be use cases where Spark would be the inevitable choice. Spark considered being an excellent tool for use cases like ETL of a large amount of a dataset, analyzing a large set of data files, Machine learning, and […]

Read More

Categories

July 2016
MTWTFSS
« Jun Aug »
 123
45678910
11121314151617
18192021222324
25262728293031