GraphSAGE
Proposal
Main paper can be found here. SaGe = Sample and Aggregation functions
- Learns parametric functions in order to generate low dimensional embeddings of the nodes in the graph
- These functions can also be used to generate embeddings of unseen nodes too
- It can also optionally use node features as starting points for embeddings
- Configurable number of layers representing the number of hops for each node
- Custom loss function (based on negative-sampling) for unsupervised learning
- Also supports supervised learning
- Scales to large graphs through neighborhood sampling and minibatch techniques
forward propagation algo
- $$h_v^0 = x_v$$
- for k = 1 to K
- for v in minibatch
- $$h_{N(v)}^k = aggregator_k(h_u^{k-1}, u \epsilon N(v))$$
- $$h_v^k = \sigma(W^k concat(h_v^{k-1}, h_{N(v)}^k))$$
- $$h_v^k = \frac{h_v^k}{L2norm(h_v^k)}$$
- for v in minibatch
- $$z_v = h_v^K$$
Where:
- $$h^k$$ = hidden vectors at each layer for all vertices
- $$V$$ = nodes
- $$N(v)$$ = sampled neighborhood for each node
- $$K$$ = number of layers (hyperparameter)
- $$\sigma$$ = a non-linear activation function
- $$W^k$$ = learnable weights for each layer
- $$z$$ = output embedding
In paper the authors show the similarity of this algo with the popular Weisfeiler-Lehman graph isomorphism test.
neighborhood sampling
- performs uniform sampling
- different samples are drawn for each layer based on different sampling rates $$S_i$$, for each layer
- authors recommend to keep $$\Pi_i S_i \le 500$$
- sampling technique is pretty much the same as seen in [PinSage]({{ site.baseurl }}{% post_url 2020-06-10-pinsage %})
- They recommend $$K = 2, S_1 = 25, S_2 = 10$$
learning
- via SGD + using Adam optimizer
- loss function for unsupervised mode is based on negative sampling and its equation is as follows: $$J(z_u) = -log(sigmoid(z_u^T z_v)) - Q E[log(sigmoid(z_u^T z_{v_n}))]$$
- $$Q$$ = number of negative samples
- $$v_n$$ = negative sample
- $$v$$ = node that is a neighbor of $$u$$
aggregators
These must be permutation invariant. They test the following aggregators
mean aggregator
- $$h_v^k = \sigma(W^k Mean(h_u^{k-1} with u \epsilon N(v), h_v^{k-1}))$$
- note that this uses $$h_v^{k-1}$$ in the aggregation itself!
- thus this can be thought of as "skip connection" between layers
LSTM aggregator
- LSTM's are not permutation invariant
- So they make it to be invariant by randomly permuting the inputs
- this seems to show the best accuracy in their experimentation
pooling aggregator
- $$h_{N(v)}^k = max_{u \epsilon N(v)}(\sigma(W_{pool}^k h_u^k + b^k))$$
- aggregator can be either max or mean, both seem to provide similar accuracy