GPU Multisplit

Main paper can be found here and their reference implementation is here.

  • efficiently bucketizing arrays based on keys
  • only small number of buckets are considered here (2 - 64)
  • for these cases, such an implementation will be much faster than full sort
  • Idea is to split the algo into 3 phases
    • local - compute bucket histograms and their scans
    • global - compute global location of each element
    • local - compute local location of each element and perform scatter
  • They also perform local multisplit to reduce memory divergence!