Teena Vashist shows us a few of the functions available with Spark for working with key-value pairs:
1. Creating Key/Value Pair RDD:
The pair RDD arranges the data of a row into two parts. The first part is the Key and the second part is the Value. In the below example, I used aparallelize
method to create a RDD, and then I used thelength
method to create a Pair RDD. The key is the length of the each word and the value is the word itself.
scala> val rdd = sc.parallelize(List("hello","world","good","morning"))rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24scala> val pairRdd = rdd.map(a => (a.length,a))pairRdd: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at map at <console>:26scala> pairRdd.collect().foreach(println)(5,hello)(5,world)(4,good)
(7,morning)
Click through for more operations. Spark is a bit less KV-centric than classic MapReduce jobs, but there are still plenty of places where you want to use them.