Proposal

Main paper can be found here.

  • a new pairnorm layer which normalizes the intermediate embeddings to avoid oversmoothing (OS)
  • allows deeper layers possible for GNNs
  • requires no extra learnable parameters (except for one hyper-param)

Summary

  • deeper GNNs show gradual loss in accuracy due to
    • vanishing gradients
    • overfitting due to increased learnable params
    • OS
  • OS?
    • phenomenon where node embeddings become very similar to each other
    • it is a form of laplacian smoothing
    • for shallow nets things are fine as the clusters of nodes will correctly get similar embeddings
    • however, for deeper nets, there'll be inter-cluster mixing (aka node-wise smoothing)
    • also, repeatedly applying convlutions (or laplacian smoothing) washes out all the signals in the features (feature-wise smoothing)
  • in order to study the effects of OS, the authors do the following experiment
    • take a SGC but strip it out of all transformation layers
    • thus there'll be no effect of overfitting or vanishing gradients
    • they plot the following 2 metrics in order to show the OS behavior
    • $$rowdiff(H^{(k)}) = \frac{1}{n^2} \Sigma_{i,j} \Sigma_p (H_{ip}^{(k)} - H_{jp}^{(k)})^2$$
    • $$coldiff(H^{(k)}) = \frac{1}{d^2} \Sigma_{i,j} \Sigma_p (\frac{H_{pi}}{\Sigma_q abs(H_{qi})} - \frac{H_{pj}}{\Sigma_q abs(H_{qj})})^2$$
    • $$n$$ = number of samples
    • $$d$$ = feature dimension
    • $$H^{(k)}$$ = computed embedding at 'k'th hop
  • authors then show the similarity of GNNs with Graph Regularized Least Squares (GRLS) method
  • then extend GRLS loss function by adding a penalty term against inter-cluster mixing, in order to minimize the effect of oversmoothing
  • they propose pairnorm to maintain the Total Pairwise Squared Distance (TPSD) metric
  • pairnorm
    • $$x_{ik}^c = x_{ik} - \frac{\Sigma_i \Sigma_k x{ik}}{n}$$
    • $$x'{ik} = \frac{s x{ik}^c}{\sqrt{\frac{1}{n} \Sigma_j \Sigma_p (x_{jp}^c)^2}}$$
    • $$s$$ = hyper-param controlling TPSD
    • works well for SGC
    • similar to batch-norm layer but without the final scaling and bias
  • pairnorm-SI (scale individual)
    • $$x'{ik} = \frac{s x{ik}^c}{\Sigma_p (x_{ip}^c)^2}$$
    • works well for SGC, GAT and GCN