Summary of this and this tutorials on word2vec, as well as comments on the original C-implementation of the word2vec model

learning task
- input is a one-hot vector of the vocabulary
- hidden layer dimension = 300 (based on google's popular word2vec on 1B dataset)
- output is softmax with same dimension as input
  - predicts the input word's neighboring words
after learning, just keep the weights leading to hidden layer!
but training this would be computationally expensive
it will also overfit on small datasets
hence, the paper suggests
- treating common word-pairs as single words
- subsampling frequently occuring words during training
- negative sampling to only update a few percentage of the weights during backprop
subsampling is done based on the word frequency
NS is also done based on the word frequency

code
params
- threshold - ratio above which to consider phrasing 2 words
- min_count - discard words appearing less than this number of times
has a custom hash function to efficiently look up vocabulary
if the vocab size >= 70% of hash table, prune out the vocab to reduce hash collisions
finally writes out the "grouped" version of the input dataset

code
most of the basic primitives are shared with word2phrase.c
main guy is TrainModelThread, which gets called by multiple posix threads from TrainModel
learning rate will decay after every certain iterations
subsampling of frequent words done through an empirical equation
- we could potentially compute and cache it before the start of training loop
for hierarchical softmax, it creates some sort of a binary tree?