Summary of this paper on parallelizing word2vec model training on Intel CPUs.

  • convert level-1 blas to level-3 by using
    • shared negative samples
    • group multiple input contexts words for a given target word
  • for scaling to multi-nodes
    • model update frequency is tied to word frequency
    • reduce starting learning rate as the number of nodes
    • m-weighted sampling updates