Summary of this paper on parallelizing word2vec model training on GPUs.

  • fast SGHS and SGNS implementations
  • allocates one threadblock per sentence in the batch
  • a single kernel for forward and backward passes, with a __syncthreads calls separating these
  • share the negative samples across the current window
  • use a custom 4x8 tiled matrix multiplication implementation
  • word pre-processing completely happens on multi-threaded CPU
  • they assume CPU preprocessing to completely overlap with GPU computations
  • code can be found here