Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
This paper addresses the critical bottleneck of Non-Uniform Memory Access (NUMA) in disaggregated AI GPUs for large-scale attention workloads. It introduces 'Swizzled Head-first Mapping,' a spatially-aware scheduling strategy that aligns attention heads with GPU NUMA domains to exploit cache reuse, achieving up to 50% higher performance on AMD MI300X.
Significantly improves the efficiency and performance of AI hardware, enabling faster training and inference for large models, and reducing operational costs for AI deployments.