Sber Updates GigaChat and Shares Interesting Engineering Details on How They Did It.

Sber updated GigaChat and shared a lot of interesting engineering details about how they did it. In November, they released a preview of Dense models, and now a full release on MoE architecture (MoE + MTP + MLA). Two models: Ultra with 702B parameters (36B active) and Lightning with 10B (1.8B active). Both are under MIT, both trained from scratch without foreign weights. The most valuable part of the release is not the weights themselves, but the breakdown on Habr about how they got there. The transition from Dense to MoE revealed a lot of problems that are not described in theory. The main pain point was the generation looping. The model started to repeat fragments endlessly, and standard approaches did not help. In the end, they wrote their own metric for cycle detection and rebuilt the entire post-training pipeline. The DPO stage was translated into native FP8 — and here was a surprise: the quality turned out to be higher than in bf16, with half the memory consumption. They also found a critical bug in SGLang with dp > 1, which quietly spoiled the benchmarks. By the numbers: Ultra outperforms DeepSeek-V3-0324 and Qwen3-235B in mathematics and reasoning. Lightning is comparable in benchmarks to similarly sized Qwen, and in arenas, it reaches the level of GPT-4o. For local deployment, it is a very competitive product.

AI Tools Mentioned