The Power Of Resilient Distributed Datasets

Ramandeep Kaur explains just how powerful Resilient Distributed Datasets are:

A fault-tolerant collection of elements that can be operated on in parallel:  “Resilient Distributed Dataset” a.k.a. RDD

RDD (Resilient Distributed Dataset) is the fundamental data structure of Apache Spark which are an immutable collection of objects which computes on the different node of the cluster. Each and every dataset in Spark RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster.

RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state.

Coarse-grained transformations are those that are applied over an entire dataset. On the other hand, a fine grained transaction is one applied on smaller set, may be a single row. But with fine grained transactions you have to save the updates which can be costlier but it is flexible than a coarse grained one.

Read on for more about the fundamental data structure in Spark.

Related Posts

Working With Images In Spark 2.4

Tomas Nykodym and Weichen Xu give us an update on working with images in the most recent version of Apache Spark: An image data source addresses many of these problems by providing the standard representation you can code against and abstracts from the details of a particular image representation.Apache Spark 2.3 provided the ImageSchema.readImages API (see Microsoft’s post […]

Read More

Comparing Streaming Engines

George Vetticaden compares Spark Streaming, Storm, and Kafka Streams: Before the addition of Kafka Streams support, HDP and HDF supported two stream processing engines:  Spark Structured Streaming and Streaming Analytics Manager (SAM) with Storm. So naturally, this begets the following question:Why add a third stream processing engine to the platform?With the choice of using Spark […]

Read More


August 2018
« Jul Sep »