Proposal

Main paper can be found here.

  • Efficient full-graph GNN training on single node
  • Scalable full-graph GNN training on multi-nodes
  • A backend to DGL for optimized CPU training

Summary

  • Rearrangement of the aggregation loop in order to reuse neighbor vertex features
  • Use libxsmm in order to utilize SIMD units while doing neighborhood aggregation
  • usage of vertex-cut based graph partitioning in order to minimize communication cost
  • propose 3 ways to scale training to multi-nodes (all are data-parallel based)
    • 0c - ignore split vertices aggregation from other socket/nodes
    • cd-0 - for the current epoch wait for partial aggregates from all split vertices to be available
    • cd-r - overlap communication with computation by doing a Hogwild-like delayed aggregation of split-vertex embeddings