Understanding Spark APIs

Kevin Feasel

2016-07-19

Spark

Jules Damji explains when to use RDDs, when to use DataFrames, and when to use Datasets in Spark:

Like an RDD, a DataFrame is an immutable distributed collection of data. Unlike an RDD, data is organized into named columns, like a table in a relational database. Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction; it provides a domain specific language API to manipulate your distributed data; and makes Spark accessible to a wider audience, beyond specialized data engineers.

With Spark 2.0, the balance moves in favor of the more structured data types.  What’s old is new; what’s unstructured is structured…

Related Posts

Last-Click Attribution With Databricks Delta

Caryl Yuhas and Denny Lee give us an example of building a last-click digital marketing attribution model with Databricks Delta: The first thing we will need to do is to establish the impression and conversion data streams.   The impression data stream provides us a real-time view of the attributes associated with those customers who were served the […]

Read More

Getting Started With Azure Databricks

David Peter Hansen has a quick walkthrough of Azure Databricks: RUN MACHINE LEARNING JOBS ON A SINGLE NODE A Databricks cluster has one driver node and one or more worker nodes. The Databricks runtime includes common used Python libraries, such as scikit-learn. However, they do not distribute their algorithms. Running a ML job only on the driver might not […]

Read More

Categories

July 2016
MTWTFSS
« Jun Aug »
 123
45678910
11121314151617
18192021222324
25262728293031