Heterogeneous Graph Transformer
Proposal
Main paper can be found here.
- uses meta-relations on node and edge types for heterogeneous graphs
- proposes HGT network for working on such graphs
- propose a HGSampling algo tailor-made for this network for the purpose of minibatch training
- propose Relative Temporal Encoding (RTE) to work with dynamic graphs
Summary
Hetergeneous mutual attention
-
$$ H^l(t) = Agg_{s \in N(t)}(Att(s, t) Msg(s))$$
-
where:
- $$t$$ = target node
- $$s$$ = source nodes
- $$l$$ = $$l$$-th layer of the network
- $$H$$ = output node embedding of the layer
- $$Agg$$ = heterogeneous aggregation function. $$Agg(.) = \sigma(Mean(.))$$
- $$N$$ = neighborhood of the target node
- $$Att$$ = heterogeneous edgewise attention weights
- $$Msg$$ = heterogeneous message passing from the source nodes. $$Msg(s) = W H^{l-1}(s)$$
- $$L$$ = number of layers in the network
-
Hetergeneous Attention
- $$Att(s, t) = softmax_{s \in N(t)}(concat_{i \in [1,h]}(AttHead^i(s, e, t))$$
- $$h$$ = number of attention heads
- $$AttHead^i(s, e, t) = K^i(s) W^{ATT}{\phi(e)} Q^i(t)^T \frac{\mu{\tau(s),\phi(e),\tau(t)}}{\sqrt{d}}$$
- $$\phi$$ = edge types
- $$\tau$$ = node types
- $$\mu$$ = significance weights based on the meta relation triplet
- $$d$$ = dimensionality of the final attention output
- $$K^i(s) = KLinear^i_{\tau(s)}(H^{l-1}(s))$$
- $$Q^i(t) = QLinear^i{\tau(t)}(H^{l-1}(t))$$
- $$W^{ATT}_{\phi(e)}$$ = edge-based attention weights
-
Heterogeneous message passing
- $$Msg(s, e, t) = concat_{i \in [1,h]}(MsgHead^i(s, e, t))$$
- $$MsgHead^i(s, e, t) = MLinear^i_{\tau(s)}(H^{l-1}(s)) W^{MSG}_{\phi(e)}$$
- $$W^{MSG}_{\phi(e)}$$ = edge-based message weights
-
target specific aggregation
- $$H^{~(l)}(t) = \Sigma_{s \in N(t)}(Att(s, e, t) . Msg(s, e, t))$$
- $$H^l(t) = ALinear_{\tau(t)}(\sigma(H^{~(l)}(t))) + H^{l-1}(t)$$
RTE
- to help work with dynamic graphs with timestamps
- based on Transformer's positional encoding technique
- $$\Delta(t, s) = T(t) - T(s)$$
- $$T$$ = timestamp associated with the nodes/edges
- $$B(\Delta(t, s), 2i) = sin(\frac{\Delta(t, s)}{10000^{\frac{2i}{d}}})$$
- $$B(\Delta(t, s), 2i+1) = cos(\frac{\Delta(t, s)}{10000^{\frac{2i+1}{d}}})$$
- $$RTE(\Delta(t, s)) = TLinear(Base(\Delta(t, s)))$$
- Finally, the input node-embeddings at $$l$$-th layer will be:
- $$H^{~(l)}(s) = H^{l-1}(s) + RTE(\Delta(t, s)$$
HGSampling
-
sampling method needs to be aware of different node and edge types so as not to create imbalances in sampling and reduce variance in sampling
-
AddInBudget(B, t, A, NS)
-
for each source node type $$\tau$$ and edge type $$\phi$$:
- normalized node-degree $$D_t = \frac{1}{len(A_{\tau(s), \phi(e), \tau(t)}(t))}$$
- for source node $$s$$ in $$A_{\tau(s), \phi(e), \tau(t)}(t)$$:
- if $$s$$ is not sampled:
- if $$s$$ has no timestamp then $$s.time = t.time$$
- update $$B(\tau)(s) += D_t$$
- if $$s$$ is not sampled:
-
for $$t \in NS$$:
- AddInBudget(B, t, A, NS)
-
for $$l \in [1, L]$$:
- for source node type $$\tau \in B$$:
- for source node $$s \in B(\tau)$$:
- compute probability $$p^{l-1}(\tau)(s) = \frac{B(\tau)(s)^2}{L2Norm(B(\tau))^2}$$
- sample $$n$$ nodes $$t^n_i$$ from $$B(\tau)$$ using $$p^{l-1}(\tau)$$
- for $$t \in t^n_i$$:
- $$OS(\tau).add(t)$$
- AddInBudget(B, t, A, NS)
- $$B(\tau).pop(t)$$
- for source node $$s \in B(\tau)$$:
- for source node type $$\tau \in B$$:
-
construct the adjacency matrix for the subsampled graph
-
$$A$$ = adjacency matrix
-
$$B$$ = budget info used for computing probability of sampling
-
$$NS$$ = sampled node set
-
$$OS$$ = output sampled node set