Main paper can be found here.

there are multiple approaches given here
the simplest in terms of cuda is the one-pass sampling
- generate exponentially distributed random numbers
- scale them wrt the input weights of the data to be sampled
- sort this scaled array
- pick the items corresponding to the indices of the top-k items in this sorted array
the issue is that this thing cannot be directly used for permuting arrays!