Single-pass Parallel Prefix Scan with Decoupled Look-back
Single-pass Parallel Prefix Scan with Decoupled Look-back
Main paper can be found here.
- requires only ~2n memory accesses
- achieves SOL on memory BW
- launch as many CTAs on the chip as the occupancy limits
- spread the work on these CTAs so as to cover the whole input array
- have each block compute upsweep phase of the prefix scan
- store the aggregate as well as inclusive prefix sum of these aggregates from preceding CTAs
- use a 3-valued ternary flag to indicate invalid/aggregate-done/prefix-done
- use this flag from preceding CTAs to compute the prefix-sum for this CTA to finish the task