Differences in Spark RDDs and DataSets

Brad Llewellyn looks at some of the differences between RDDs and DataSets in Spark:

We see that there are some differences between filtering RDDs, Data Frames and Datasets. The first major difference is the same one we keep seeing, RDDs reference by indices instead of column names. There’s also an interesting difference of using 2 =’s vs 3 =’s for equality operators. Simply put, “==” tries to directly equate two objects, whereas “===” tries to dynamically define what “equality” means. In the case of filter(), it’s typically used to determine whether the value in one column (income, in our case) is equal to the value of another column (string literal “<=50K”, in our case). In other words, if you want to compare values in one column to values in another column, “===” is the way to go.

Interestingly, there was another difference caused by the way we imported our data. Since we custom-built our RDD parsing algorithm to use <COMMA><SPACE> as the delimiter, we don’t need to trim our RDD values. However, we used the built-in sqlContext.read.csv() function for the Data Frame and Dataset, which doesn’t trim by default. So, we used the ltrim() function to remove the leading whitespace. This function can be imported from the org.apache.spark.sql.functions library.

Read on for more, including quite a few code samples.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31