Working With Key-Value Pairs In Spark

Teena Vashist shows us a few of the functions available with Spark for working with key-value pairs:

1. Creating Key/Value Pair RDD: 
The pair RDD arranges the data of a row into two parts. The first part is the Key and the second part is the Value. In the below example, I used a parallelize method to create a RDD, and then I used the length method to create a Pair RDD. The key is the length of the each word and the value is the word itself.

scala> val rdd = sc.parallelize(List("hello","world","good","morning"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val pairRdd = => (a.length,a))
pairRdd: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at map at <console>:26
scala> pairRdd.collect().foreach(println)

Click through for more operations.  Spark is a bit less KV-centric than classic MapReduce jobs, but there are still plenty of places where you want to use them.

Related Posts

Testing an Event-Driven System

Andy Chambers takes us through how to test an event-driven system: Each distinct service has a nice, pure data model with extensive unit tests, but now with new clients (and consequently new requirements) coming thick and fast, the number of these services is rapidly increasing. The testing guardian angel who sometimes visits your thoughts during […]

Read More

Monads and Monoids and Functors

Anmol Sarna explains the concept of a monad: In functional programming, a monad is a design pattern that allows structuring programs generically while automating away boilerplate code needed by the program logic. To simplify the above definition a bit more, We can think of monads as wrappers. You just take an object and wrap it with […]

Read More


December 2018
« Nov Jan »