Main paper can be found here.

Rearrangement of the aggregation loop in order to reuse neighbor vertex features
Use libxsmm in order to utilize SIMD units while doing neighborhood aggregation
usage of vertex-cut based graph partitioning in order to minimize communication cost
propose 3 ways to scale training to multi-nodes (all are data-parallel based)
- 0c - ignore split vertices aggregation from other socket/nodes
- cd-0 - for the current epoch wait for partial aggregates from all split vertices to be available
- cd-r - overlap communication with computation by doing a Hogwild-like delayed aggregation of split-vertex embeddings