Summary of this paper on BlazingText.

parallelizing word2vec SGD on GPUs
typical way of parallelizing SGD is through Hogwild approach
- ignore conflicts that might arise between read/write of weights
- since there are not many conflicts, convergence is not usually affected
they too use Intel's minibatching technique + shared negative samples here
they implement both the following kernels
- one CTA per word
  - each thread maps to a vector dimension
  - peak parallel perf
  - but reduced accuracy due to increased probability of conflicts
- one CTA per sentence
  - each thread maps to a vector dim
  - medium perf
  - due to reduced conflicts, gives better accuracy
  - more sentences worked upon at the same time increases chances of conflicts!
distributed training
- use data parallelism only
- use ncclAllReduce
- synchronize at the end of each epoch
- they notice reduced accuracy with more GPUs added (specifically > 4)