Beating an All-GPU DSA Kernel by Offloading Top-K to AVX-512
PyTorch's CPU top-K isn't actually vectorized. An 80-line AVX-512 kernel beats it by up to 34x, and mid-pipeline CPU offload of the DSA selection stage beats an all-GPU fused Triton kernel by 1.2x end-to-end on RTX PRO 6000 Blackwell + Zen 5.