Page Ranking With Kafka Streams

Hunter Kelly walks through a page ranking algorithm:

Once you have the adjacency matrix, you perform some straightforward matrix calculations to calculate a vector of Hub scores and a vector of Authority scores as follows:

  • Sum across the columns and normalize, this becomes your Hub vector
  • Multiply the Hub vector element-wise across the adjacency matrix
  • Sum down the rows and normalize, this becomes your Authority vector
  • Multiply the Authority vector element-wise down the the adjacency matrix
  • Repeat

An important thing to note is that the algorithm is iterative: you perform the steps above until  eventually you reach convergence—that is, the vectors stop changing—and you’re done. For our purposes, we just pick a set number of iterations, execute them, and then accept the results from that point.  We’re mostly interested in the top entries, and those tend to stabilize pretty quickly.

This is an architectural-level post, so there’s no code but there is a useful discussion of the algorithm.

Related Posts

Testing an Event-Driven System

Andy Chambers takes us through how to test an event-driven system: Each distinct service has a nice, pure data model with extensive unit tests, but now with new clients (and consequently new requirements) coming thick and fast, the number of these services is rapidly increasing. The testing guardian angel who sometimes visits your thoughts during […]

Read More

Processing Fixed-Width Files with Spark

Subhasish Guha shows how you can read a fixed-with file with Apache Spark: A fixed width file is a very common flat file format when working with SAP, Mainframe, and Web Logs. Converting the data into a dataframe using metadata is always a challenge for Spark Developers. This particular article talks about all kinds of […]

Read More

Categories

October 2017
MTWTFSS
« Sep Nov »
 1
2345678
9101112131415
16171819202122
23242526272829
3031