MLE and KLD
- Assume that $$P_r(x)$$ is the real distribution to be approximated
- MLE = $$max_\Theta \frac{1}{m} \Sigma_{i=1}^m log P_\Theta(x_i)$$
- in the limit $$m -> \infty$$, this is the same as minimizing KLD!
Equations
- $$lt_{m -> \infty} max_\Theta \frac{1}{m} \Sigma_{i=1}^m log P_\Theta(x)$$
- $$max_\Theta (integral(P_r(x) log P_\Theta(x) dx))$$
- $$min_\Theta (-integral(P_r(x) log P_\Theta(x) dx))$$
- $$min_\Theta (integral(P_r(x) log P_r(x) dx) - integral(P_\Theta(x) log P_\Theta(x) dx))$$
- $$min_\Theta (integral(P_r(x) log \frac{P_r(x)}{P_\Theta(x)} dx))$$
- Which is the same as minimizing KLD!
Issues with KLD
- KLD has numerical stability issues when $$P_r(x)$$ or $$P_\Theta(x)$$ is close
to zero
- This is typically solved by adding noise to $$P_\Theta$$
- Sampling from $$P_\Theta$$ is computationally expensive
Conclusion
- Thus, better to learn a function $$g_\Theta$$ which can transform a given
transform a given distribution into $$P_\Theta$$ ie. $$P_\Theta ~= g_\Theta(z)$$
- This is the basis of GAN's :)