Main paper can be found here.

uses meta-relations on node and edge types for heterogeneous graphs
proposes HGT network for working on such graphs
propose a HGSampling algo tailor-made for this network for the purpose of minibatch training
propose Relative Temporal Encoding (RTE) to work with dynamic graphs

$$ H^l(t) = Agg_{s \in N(t)}(Att(s, t) Msg(s))$$
where:
- $$t$$ = target node
- $$s$$ = source nodes
- $$l$$ = $$l$$-th layer of the network
- $$H$$ = output node embedding of the layer
- $$Agg$$ = heterogeneous aggregation function. $$Agg(.) = \sigma(Mean(.))$$
- $$N$$ = neighborhood of the target node
- $$Att$$ = heterogeneous edgewise attention weights
- $$Msg$$ = heterogeneous message passing from the source nodes. $$Msg(s) = W H^{l-1}(s)$$
- $$L$$ = number of layers in the network
Hetergeneous Attention
- $$Att(s, t) = softmax_{s \in N(t)}(concat_{i \in [1,h]}(AttHead^i(s, e, t))$$
- $$h$$ = number of attention heads
- $$AttHead^i(s, e, t) = K^i(s) W^{ATT}{\phi(e)} Q^i(t)^T \frac{\mu{\tau(s),\phi(e),\tau(t)}}{\sqrt{d}}$$
- $$\phi$$ = edge types
- $$\tau$$ = node types
- $$\mu$$ = significance weights based on the meta relation triplet
- $$d$$ = dimensionality of the final attention output
- $$K^i(s) = KLinear^i_{\tau(s)}(H^{l-1}(s))$$
- $$Q^i(t) = QLinear^i{\tau(t)}(H^{l-1}(t))$$
- $$W^{ATT}_{\phi(e)}$$ = edge-based attention weights
Heterogeneous message passing
- $$Msg(s, e, t) = concat_{i \in [1,h]}(MsgHead^i(s, e, t))$$
- $$MsgHead^i(s, e, t) = MLinear^i_{\tau(s)}(H^{l-1}(s)) W^{MSG}_{\phi(e)}$$
- $$W^{MSG}_{\phi(e)}$$ = edge-based message weights
target specific aggregation
- $$H^{~(l)}(t) = \Sigma_{s \in N(t)}(Att(s, e, t) . Msg(s, e, t))$$
- $$H^l(t) = ALinear_{\tau(t)}(\sigma(H^{~(l)}(t))) + H^{l-1}(t)$$

to help work with dynamic graphs with timestamps
based on Transformer's positional encoding technique
$$\Delta(t, s) = T(t) - T(s)$$
$$T$$ = timestamp associated with the nodes/edges
$$B(\Delta(t, s), 2i) = sin(\frac{\Delta(t, s)}{10000^{\frac{2i}{d}}})$$
$$B(\Delta(t, s), 2i+1) = cos(\frac{\Delta(t, s)}{10000^{\frac{2i+1}{d}}})$$
$$RTE(\Delta(t, s)) = TLinear(Base(\Delta(t, s)))$$
Finally, the input node-embeddings at $$l$$-th layer will be:
- $$H^{~(l)}(s) = H^{l-1}(s) + RTE(\Delta(t, s)$$

sampling method needs to be aware of different node and edge types so as not to create imbalances in sampling and reduce variance in sampling
AddInBudget(B, t, A, NS)
for each source node type $$\tau$$ and edge type $$\phi$$:
- normalized node-degree $$D_t = \frac{1}{len(A_{\tau(s), \phi(e), \tau(t)}(t))}$$
- for source node $$s$$ in $$A_{\tau(s), \phi(e), \tau(t)}(t)$$:
  - if $$s$$ is not sampled:
    - if $$s$$ has no timestamp then $$s.time = t.time$$
    - update $$B(\tau)(s) += D_t$$
for $$t \in NS$$:
- AddInBudget(B, t, A, NS)
for $$l \in [1, L]$$:
- for source node type $$\tau \in B$$:
  - for source node $$s \in B(\tau)$$:
    - compute probability $$p^{l-1}(\tau)(s) = \frac{B(\tau)(s)^2}{L2Norm(B(\tau))^2}$$
  - sample $$n$$ nodes $$t^n_i$$ from $$B(\tau)$$ using $$p^{l-1}(\tau)$$
  - for $$t \in t^n_i$$:
    - $$OS(\tau).add(t)$$
    - AddInBudget(B, t, A, NS)
    - $$B(\tau).pop(t)$$
construct the adjacency matrix for the subsampled graph
$$A$$ = adjacency matrix
$$B$$ = budget info used for computing probability of sampling
$$NS$$ = sampled node set
$$OS$$ = output sampled node set