The key idea is to transform global synchronization into global communication so that conflicts are serialized at the thread block level.

GPUs do not have coherent caches, so GPUs need to touch the last level cache to resolve massive lock contention.

This paper views the GPU chip as a distributed system, where each thread block (TB) is a node equipped with a fast private memory (i.e., scratchpad memory), and the multiple nodes share a communication medium (i.e., global memory). Thus, our architecture decouples a baseline GPU kernel into two types of thread blocks that run concurrently on the GPU.

Designating one or more threads as servers has been widely used for multi-core and many-core CPUs (as well as databases). These solutions share the principle of transforming synchronization into communication.