Pixie
Pixie: A System for Recommending 3+ Billion Items to200+ Million Users in Real-Time
Main paper can be found here and a more digestable version is presented in a blog format here and here.
Proposal
Provide real-time recommendations to the users via:
- Pixie random walk algorithm to generate recommendations whose runtime is constant wrt the size of the input graph.
- description of the implementation (in production) based on C++ on top of SNAP.
Details
Input to this system is a query set $$Q = {(q, w_q)}$$, which is a collection of pins with associated weights to represent its importance in current query set.
Pixie random walk
- Biased walks: walks are biased for a query set $$Q$$ based on the user. This helps provide personalized recommendations.
- multiple query pins with weights: weights to the input query pins are assigned based on the time since the user interacted with this pin and the interaction type.
- sample walk length: helps provide enough walks for pins with lower degree but also reduces runtime cost for pins with higher degree.
- multi-hit booster: weightage given to the candidate pins which have high visit counts for all the pins in the $$Q$$. $$V[p] = (\sum_{q \epsilon Q}\sqrt{V_q[p]})^2$$ where $$V_q$$ is visit count for the query $$q \epsilon Q$$.
- early stopping: stop the random walk when the top visited candidates have crossed certain threshold.
Computation of walk length for a given pin is as follows:
$$s_q = |E(q)|.(C - log|E(q)|)$$
$$Nq = N w_q \frac{s_q}{\sum_{r \epsilon Q} s_r}$$
- $$Nq$$ = walk length for random walk starting from pin q
- $$|E(q)|$$ = degree of pin q
- C = max degree of all pins
Algorithm for a single query is as follows:
PixieRandomWalk(q, E, U, alpha, Nq, nv, np):
totSteps = nHighVisited = 0
V = {0}
do
currPin = q
currSteps = SampleWalkLength(alpha)
for i in range(currSteps):
currBoard = E(currPin)[PersonalizedNeighbor(E, U)]
currPin = E(currBoard)[PersonalizedNeighbor(E, U)]
V[currPin]++
if V[currPin] == nv:
nHighVisited++
totSteps += currSteps
while totSteps < Nq and nHighVisited < np
- q = query pin
- E = edges in the graph
- U = user features
- alpha = helps determine the number of steps per random walk
- PersonalizedNeighbor(E, U) = randomized method to return neighbors, but edges being weighted by user features
- Nq = total number of steps taken in the random walk
- nv and np = used for early stopping. the random walk is terminated when there are np number of pins that have been visited atleast nv number of times
graph pruning
Prunes the graph to fit in RAM and thus do not have to distribute it.
- Removes duplicate pins
- Removes pins assigned to incorrect boards
- Removes boards with larger entropy: computes LDA on each pin description and computes entropy for a board by combining LDA vectors from all pins associated with this board.
- Prune the degree (or remove) pins with abnormally high degree: only keep the edges with highest cosine similarity. Threshold is determined by the pruning factor $$\delta$$.