Anmol Sarna walks us through some of the basics of Resilient Distributed Datasets in Apache Spark:
-
Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
-
Distributed with data residing on multiple nodes in a cluster.
-
Dataset is a collection of partitioned data.
Now we know what RDD stands for. Now let’s try to understand it.
It’s a nice intro to the topic. And even though there are other data models which sit on top of RDDs to make life easier for developers, it’s still important to understand the core model in Spark.