GPU Multisplit
GPU Multisplit
Main paper can be found here and their reference implementation is here.
- efficiently bucketizing arrays based on keys
- only small number of buckets are considered here (2 - 64)
- for these cases, such an implementation will be much faster than full sort
- Idea is to split the algo into 3 phases
- local - compute bucket histograms and their scans
- global - compute global location of each element
- local - compute local location of each element and perform scatter
- They also perform local multisplit to reduce memory divergence!