Gemma 3 Model: Multimodal & Efficient LLM

Updated 9 November 2025

Gemma 3 is a third-generation lightweight, open LLM with integrated vision understanding and extended multilingual support.
It employs an interleaved local/global attention mechanism with RoPE rescaling, reducing KV-cache memory overhead significantly for 128K token contexts.
The model integrates a frozen SigLIP vision encoder with advanced distillation and quantization methods to deliver superior instruction-following and multimodal performance.

Gemma 3 is the third-generation model in the Gemma family of lightweight, open LLMs, notable for its integration of a multimodal vision-understanding module, extended multilingual coverage, and highly efficient long-context handling in model sizes spanning 1 to 27 billion parameters. Designed to operate on commodity hardware, Gemma 3 advances both architectural and training methodologies to deliver improved instruction-following, zero-shot, and vision-language performance across diverse benchmarks. The model and its quantized artifacts are publicly released, facilitating broad research and application.

1. Model Architecture and Scaling

Gemma 3 adopts a decoder-only Transformer architecture, with four principal scales: 1B, 4B, 12B, and 27B parameters. The parameter breakdown includes embedding, Transformer, and (for models ≥4B) a vision encoder. The parameterization is summarized below:

Model	Vision Encoder	Embedding	Non-embedding
1 B	0	302 M	698 M
4 B	417 M	675 M	3,209 M
12 B	417 M	1,012 M	10,759 M
27 B	417 M	1,416 M	25,600 M

The 1B model serves 32K token context and is text-only, while larger variants (≥4B) deliver 128K context and multimodal capabilities via a frozen 417M-parameter SigLIP visual encoder. The models are differentiated by their context window, vision support, and scaled compute/data budgets (1B: 2T tokens; 27B: 14T tokens).

The attention stack interleaves local and global layers at a 5:1 ratio, leveraging local sliding-window attention (span=1024) in most layers, with every sixth layer being global (full attention). This design significantly reduces the quadratic scaling of key-value (KV) memory at inference. For context length $T$ and model dimension $d_{\text{model}}$ , the KV-cache memory requirements satisfy

$\text{KV}_{\text{mem}} \approx \left(\frac{N_{\text{global}}}{N_{\text{local}}+N_{\text{global}}}\right) T d_{\text{model}}$

resulting in a 6-fold reduction compared to a fully global stack (empirically reducing 32K KV-cache overhead from ~60% to <15%).

2. Long-Context and Positional Encoding

Gemma 3 extends context windows to 128K tokens (except for the 1B model), employing a two-stage method:

Pre-training uses 32K token sequences.
Global-attention layers are "upscaled" post hoc to 128K via linear rescaling of Rotary Positional Embedding (RoPE) base frequency from 10K to 1M, following the methodology of chen2023extending.

Local layers retain the original 10K RoPE base to preserve short-range recency. This rescaling enables generalization to substantially longer contexts with only mild degradation in perplexity.

3. Multimodal Vision Understanding

The ≥4B Gemma 3 variants integrate a frozen SigLIP Vision Transformer (ViT) encoder. The encoder accepts $896 \times 896$ pixel images, outputting a patch embedding grid that is average pooled to produce 256 "soft" image tokens per image. At input, these image tokens are concatenated directly with textual token embeddings.

To accommodate variable image aspect ratios and resolutions at inference, a Pan & Scan algorithm divides each image into tiled windows, resizes them, feeds them through the encoder, and finally truncates/pools the resulting embeddings to fit model input constraints.

4. Training Methodology and Post-Training

Gemma 3 pre-training combines standard next-token prediction (autocross-entropy) and large-scale knowledge distillation from a teacher model. Key components include:

Multi-trillion token training set composed of text and image pairs, with expanded multilingual coverage via monolingual and parallel corpora rebalance using UniMax sampling.
Distillation via sampled teacher logits ( $S=256$ per token), renormalization, and student cross-entropy loss minimization.
Data decontamination, instance-level quality reweighting and in-training safety filtering.

Quantization-aware training (QAT) follows, producing int4 and SFP8 checkpoints by additional fine-tuning (5K steps), allowing large models to run on-device at long contexts. For example, the 27B model requires 32.8 GB (int4+KV) for 32K context.

Instruction finetuning adopts a composite approach: initial distillation from an instruction-tuned teacher, augmented by a brief RLHF phase combining (a) weight-averaged reward models (WARM), (b) best-of-N distillation (BOND), (c) on-policy distillation, (d) code exection and math-specific rewards. The total loss is:

$L = \mathbb{E}_{\text{data}}\left[-\sum_t \log p_\theta(x_t|x_{<t})\right] + \lambda_{\text{RL}} \mathbb{E}_\pi[R]$

Post-processing filters personal data, enforces refusal, promotes robustness and factuality.

5. Multilingualism and Benchmark Evaluation

The pre-training corpus spans dozens of languages with an emphasis on low-resource language growth. Gemma 3 reuses the Gemini 2 SentencePiece tokenizer (262k vocabulary) optimized for non-English data.

Multilingual sampling frequency is managed using UniMax, enhancing representation for under-sampled languages. Evaluation on MGSM, Global-MMLU-Lite, WMT24++, FLoRes, XQuAD, ECLeKTic, and IndicGenBench shows that Gemma 3-27B achieves 76.8 F1 on XQuAD and 59.5 F1 on FLoRes-Indic (outperforming Gemma 2-27B by 2–5 points).

6. Empirical Performance and Comparison

The following summarizes Gemma 3’s empirical results:

On factuality (HellaSwag, BoolQ, PIQA), Gemma 3 matches Gemma 2.
STEM/code benchmarks show marked improvements: GSM8K (12B) rises from 70.2% to 71.0%, MBPP from 51.2% to 60.4%.
Multimodal performance as indicated by COCO Caption CIDEr increases across scales: 102 (4B), 111 (12B), 116 (27B).
Instruction-tuned Gemma 3-4B-IT is competitive with Gemma 2-27B-IT; Gemma 3-27B-IT closely approaches Gemini-1.5-Pro.
LMSYS Chatbot Arena Elo for Gemma 3-27B-IT is 1338, above Gemma 2-27B-IT (1220) and below GPT-4.5, Grok-3.

On MMLU-Pro, Gemma 3-27B-IT yields 67.5% versus Gemini 2.0-Pro’s 79.1% and Gemma 2-27B-IT’s 56.9%. On MATH, Gemma 3-27B-IT achieves 89.0% compared to Gemini 2.0-Pro at 91.8%, and Gemma 2-27B-IT at 55.6%.

7. Scientific Use Case: Modular "Internal World" in Wildfire Prediction

Gemma 3 mid-layer sub-blocks (layers 8–9 of Gemma 3-1B) serve as pretrained, frozen modules encoding generalized representations ("internal world"). In (Jadouli et al., 20 Apr 2025), a modular architecture for Moroccan wildfire prediction projects tabular features into the Gemma hidden space and routes them through these frozen Transformer layers before a task-specific output MLP. The approach achieves high recall (0.9433) and F1 (0.8838) with only ~5M trainable parameters, exceeding standard baselines and emphasizing generalization over overfitting for limited data regimes. Notably, fine-tuning the frozen Gemma block led to instability; the frozen approach delivered a +6.7% recall gain over a fully trained feed-forward network.

This suggests that fixed pretrained Transformer sub-layers can serve as reusable adapters for domain-specific tasks, exploiting broad world knowledge while minimizing the need for large annotated datasets.

8. Key Contributions and Significance

Gemma 3 introduces several innovations:

Interleaved local/global attention (5:1) enabling tractable 128K-token autoregressive context with substantial memory reduction.
RoPE rescaling for efficient scaling of global context during upscaling.
Multimodal fusion via a lightweight SigLIP encoder, supporting strong VQA and captioning capabilities.
Unified pretrain and post-train pipeline, leveraging distillation and composite policy-gradient RL with advanced data filtering.
Expanded and balanced multilingual data leveraging the UniMax strategy.

Collectively, these advances democratize access to long-context, multilingual, and multimodal LLMs in a form factor operable on consumer hardware and support diverse applications ranging from instruction following to scientific predictive modeling.

PDF Markdown Chat (Pro)

References (1)

Deep Learning with Pretrained 'Internal World' Layers: A Gemma 3-Based Modular Architecture for Wildfire Prediction (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Gemma 3 Model.