Flash Attention 4 Released

Flash Attention 4 has been released. This time, it has been specially optimized for the new Blackwell architecture (B200 and GB200), unlike FA3 for Hopper. There is no additional gain on various 5090s. For BF16, it provides up to 1.3x acceleration compared to cuDNN 9.13 and up to 2.7x compared to Triton! It achieves a solid 1.6 PFLOPs/s (71% of the theoretical peak of B200). In the new versions of cuDNN, some of these optimizations have also appeared. Among the key tricks are software emulation of exp, conditional softmax rescaling, and in backward, the use of tensor memory and 2-CTA MMA, which significantly reduces pressure on shared memory. Additionally, all kernel code is now written in Python (CuTe-DSL) without rigid C++ templates, making compilation 20-30 times faster. Tensor memory is a new ultra-fast on-chip buffer of Blackwell next to tensor cores, where intermediate results can be stored, reducing the need to access shared memory. The 2-CTA MMA mode allows one matmul to be computed by a pair of CTAs (thread groups) instead of one, enabling larger tiles and significantly reducing shared-memory traffic, making backward operations more efficient. For enthusiasts, I highly recommend this analysis on YouTube about how FA4 works.