NanoGPT now speedruns in 116 seconds

The last time I wrote about speedrunning was just over a year ago — back then they had just crossed the 8-minute mark on 8xH100. During this time, a huge number of optimizations have been made to the repository, speeding up training by four times. The greatest effect from one optimization was achieved by using Flex Attention — this reduced the time from 7 to 5 minutes. Such speedruns provide an accessible standard baseline for testing optimizations for training, which can be reproduced by everyone. As a result, models train faster and cheaper. Of course, not all optimizations scale to larger models, but this is noticeably better than the existing situation with reproducibility. And it is from these speedruns that Muon emerged — the main candidate to replace Adam as the standard optimizer. How long do you think it will take to train such a model in less than a minute?