Summary of this and this tutorials on word2vec, as well as comments on the original C-implementation of the word2vec model

Summary

  • learning task
    • input is a one-hot vector of the vocabulary
    • hidden layer dimension = 300 (based on google's popular word2vec on 1B dataset)
    • output is softmax with same dimension as input
      • predicts the input word's neighboring words
  • after learning, just keep the weights leading to hidden layer!
  • but training this would be computationally expensive
  • it will also overfit on small datasets
  • hence, the paper suggests
    • treating common word-pairs as single words
    • subsampling frequently occuring words during training
    • negative sampling to only update a few percentage of the weights during backprop
  • subsampling is done based on the word frequency
  • NS is also done based on the word frequency

code annotations

word2phrase.c

  • code
  • params
    • threshold - ratio above which to consider phrasing 2 words
    • min_count - discard words appearing less than this number of times
  • has a custom hash function to efficiently look up vocabulary
  • if the vocab size >= 70% of hash table, prune out the vocab to reduce hash collisions
  • finally writes out the "grouped" version of the input dataset

word2vec.c

  • code
  • most of the basic primitives are shared with word2phrase.c
  • main guy is TrainModelThread, which gets called by multiple posix threads from TrainModel
  • learning rate will decay after every certain iterations
  • subsampling of frequent words done through an empirical equation
    • we could potentially compute and cache it before the start of training loop
  • for hierarchical softmax, it creates some sort of a binary tree?