Main paper can be found here.

proposal to parallelize 2-opt and 3-opt swaps on GPUs, over-and-above the parallelization strategy discussed in [this]({{ site.baseurl }}{% post_url 2020-12-06-parallel-tsp %}) paper.
they also further note that due to global memory access latencies, it is faster to recompute the distances by each thread, instead of pre-computing them.

make each block work on a particular climber and have each thread inside it work on multiple possible pairs (or triplets).
then finally do a block-wide reduction to compute the best swap.
this approach helps them to utilize shared memory for storing the city co-ordinates, as well as for storing the best path
they also note that pre-computing the distance matrix and accessing them while doing 2-opt/3-opt swaps leads to random accesses and thereby causing perf degradation.
thus, instead they propose to recompute this value from the input coordinates by all threads during the process of swap
the number of cities they can support with this approach is only limited by the available shared memory capacity on the chip.
the authors however do not talk about the number of climbers used in their code or even the algorithm used by them (I think it is IHC as used in the base paper mentioned above).