Ranking Data in Spark – Curated SQL

Landon Robinson continues the Spark Starter Guide:

Ranking is, fundamentally, ordering based on a condition. So, in essence, it’s like a combination of a where clause and order by clause—the exception being that data is not removed through ranking , it is, well, ranked, instead. While ordering allows you to sort data based on a column, ranking allows you to allocate a number (e.g. row number or rank) to each row (based on a column or condition) so that you can utilize it in logical decision making, like selecting a top result, or applying further transformations.
One very common ranking function is row_number(), which allows you to assign a unique value or “rank” to each row or rows within a grouping based on a specification. That specification, at least in Spark, is controlled by partitioning and ordering a dataset. The result allows you, for example, to achieve “top n” analysis in Spark.

One minor adjustment I’d make is not calling the output of ROW_NUMBER() “Rank” because then it’d make me think that’s the output of the RANK() window function. In the event of ties, those two outputs will differ.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28