Hunter Kelly walks through a page ranking algorithm:

Once you have the adjacency matrix, you perform some straightforward matrix calculations to calculate a vector of Hub scores and a vector of Authority scores as follows:

- Sum across the columns and normalize, this becomes your Hub vector
- Multiply the Hub vector element-wise across the adjacency matrix
- Sum down the rows and normalize, this becomes your Authority vector
- Multiply the Authority vector element-wise down the the adjacency matrix
- Repeat
An important thing to note is that the algorithm is iterative: you perform the steps above until eventually you reach convergence—that is, the vectors stop changing—and you’re done. For our purposes, we just pick a set number of iterations, execute them, and then accept the results from that point. We’re mostly interested in the top entries, and those tend to stabilize pretty quickly.

This is an architectural-level post, so there’s no code but there is a useful discussion of the algorithm.

Kevin Feasel

2017-10-20

Architecture, Hadoop, Streaming