Summary of this paper on parallelizing word2vec model training on GPUs.

fast SGHS and SGNS implementations
allocates one threadblock per sentence in the batch
a single kernel for forward and backward passes, with a __syncthreads calls separating these
share the negative samples across the current window
use a custom 4x8 tiled matrix multiplication implementation
word pre-processing completely happens on multi-threaded CPU
they assume CPU preprocessing to completely overlap with GPU computations
code can be found here