The Basics Of RDDs In Apache Spark

Anmol Sarna walks us through some of the basics of Resilient Distributed Datasets in Apache Spark:

  • Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

  • Distributed with data residing on multiple nodes in a cluster.

  • Dataset is a collection of partitioned data.

Now we know what RDD stands for. Now let’s try to understand it.

It’s a nice intro to the topic.  And even though there are other data models which sit on top of RDDs to make life easier for developers, it’s still important to understand the core model in Spark.

Related Posts

Databricks Runtime 5.2 Released

Nakul Jamadagni announces Databricks Runtime 5.2: Delta Time TravelTime Travel, released as an Experimental feature, adds the ability to query a snapshot of a table using a timestamp string or a version, using SQL syntax as well as DataFrameReader options for timestamp expressions.Sample codeSELECT count() FROM events TIMESTAMP AS OF timestamp_expressionSELECT count() FROM events VERSION AS OF version Time travel looks a bit like temporal tables in SQL Server.

Read More

Kafka And The Differing Aims Of Data Professionals

Kai Waehner argues that there is an impedence mismatch between data engineers, data scientists, and ML production engineers: Data scientists love Python, period. Therefore, the majority of machine learning/deep learning frameworks focus on Python APIs. Both the stablest and most cutting edge APIs, as well as the majority of examples and tutorials use Python APIs. […]

Read More

Categories

July 2018
MTWTFSS
« Jun Aug »
 1
2345678
9101112131415
16171819202122
23242526272829
3031