Gemma 4 12B

Gemma 4 12B Accepts text, audio, and images with video input. The video length is limited to 30 seconds, and audio to 60 seconds. The model is a reasoning model with 256k context and an Apache 2.0 license. The most interesting aspect of the release is how multimodality is structured. Typically, multimodal models require a separate encoder, but here they use simple linear projections, which require fewer parameters and computations. Unfortunately, there is no technical report, so how they managed to train it is currently unclear. I hope that it, like the larger Gemma 4 124B, will eventually be released.

AI Tools Mentioned