Gemma-3-4B: Efficient Multimodal Transformer
- Gemma 3-4B Model is a 4.3B-parameter multimodal transformer designed for efficient vision-language reasoning and extended context processing.
- It employs innovative interleaved local and global attention mechanisms to reduce memory usage while supporting up to 128K-token contexts.
- The model integrates advanced techniques such as distillation, quantization-aware training, and RLHF, powering state-of-the-art medical vision-language applications via MedGemma 4B.
Gemma 3-4B is a 4.3 billion-parameter multimodal LLM within the Gemma 3 family of lightweight open models, designed to provide strong vision–language reasoning, extended context, and expanded multilingual competency in a computationally efficient transformer architecture. It serves as the core backbone for the MedGemma 4B model, which demonstrates state-of-the-art medical vision-language task performance relative to its scale. Gemma 3-4B incorporates innovations in attention structure, distillation pathways, quantization support, and reinforcement learning–driven instruction tuning to achieve performance competitive with much larger models on mathematics, multilingual, and multimodal benchmarks (Team et al., 25 Mar 2025, Sellergren et al., 7 Jul 2025).
1. Architecture and Attention Mechanisms
Gemma 3-4B employs a standard decoder-only transformer backbone, integrating several architectural enhancements for efficiency and scalability:
- Parameterization: 4,301M total parameters, partitioned into 417M vision encoder (frozen SigLIP ViT), 675M embedding, and 3,209M non-embedding parameters (Team et al., 25 Mar 2025).
- Attention Layout: Utilizes interleaved “local” and “global” attention. Local sliding-window attention layers alternate with global full-context ones in a 5:1 ratio. Each local layer attends to a window of tokens, significantly reducing KV-cache memory growth.
- Grouped Query Attention and Norms: Adopts Grouped-Query Attention (GQA), replacing post-softmax capping with QK-norm, and employs post-norm FFNs and RMSNorm stabilization.
- KV-Cache Efficiency: The memory footprint for keys/values per layer obeys
drastically shrinking the layers with scaling to 1/6, yielding <15% KV-cache overhead at 32,768-token contexts—compared to ∼60% in all-global attention models. This structure allows context extension to 128K tokens with manageable memory usage (Team et al., 25 Mar 2025).
2. Contextual Capacity and Positional Encoding
Gemma 3-4B supports a 128,000-token input window throughout pretraining and instruction tuning:
- RoPE Implementation: Initial pretraining uses RoPE base frequency of 10K over 32K-token sequences. For long-context deployment, RoPE is linearly interpolated on global layers by a factor of 8, resulting in a base of 1M, as described in “Extending context window of LLMs via Positional Interpolation” (Chen et al., 2023).
- Local Layer Stability: Local attention retains the 10K RoPE base, safeguarding stable locality.
- Empirical Robustness: Perplexity remains competitive up to 128K tokens, with modest degradation; accuracy declines beyond this threshold.
3. Training Methodology
The training and post-training pipeline fuses scaled data mixture, advanced distillation, and reward-driven finetuning:
- Pretraining and Distillation: Trained on 4T tokens of mixed text and image, with a deliberately heightened share of multilingual data via UniMax sampling. Teacher–student distillation at each token samples 256 logits, with renormalization and cross-entropy on the sampled subset:
- Quantization-Aware Training (QAT): Short checkpoint fine-tuning (5,000 steps) prepares weights for per-channel int4/block-int4/SW-FP8 quantized deployment, targeting full bfloat16 output distributions.
- Instruction and RLHF Fine-Tuning: Successive “improved” distillation (from a large instruction-tuned teacher), reinforcement learning (BOND, WARM, WARP), and reward integration, combining human feedback, code execution, and math objective signals:
- Data and Output Filtering: Aggressive filtration eliminates toxic/personal/self-referential content. Sequences are explicitly marked with [BOS] and
<end_of_turn>tokens.
4. Performance on Benchmarks
Gemma 3-4B-IT achieves competitive results on mathematics, reasoning, factuality, and vision-language benchmarks. Selected zero/few-shot results:
| Benchmark | Gemma 3-4B-IT | Context/CoT | Reference |
|---|---|---|---|
| MATH (acc.) | 75.6% | Yes (4-shot) | (Team et al., 25 Mar 2025) |
| GSM8K (acc.) | 89.2% | Yes (8-shot) | (Team et al., 25 Mar 2025) |
| MMLU-Pro (acc.) | 43.6% | 0-shot | (Team et al., 25 Mar 2025) |
| FACTS Grounding | 70.1% | n/a | (Team et al., 25 Mar 2025) |
| MMMU (val) | 48.8% | n/a | (Team et al., 25 Mar 2025) |
| Global MMLU-Lite | 54.5% | n/a | (Team et al., 25 Mar 2025) |
- On mathematics (MATH, GSM8K), Gemma 3-4B-IT outperforms previous 27B-parameter Gemma 2 models (e.g., MATH: 75.6% vs 49.4%).
- For factual QA and vision-language grounding, performance is robust, though trailing larger models on some knowledge benchmarks.
- Multilingual coverage exceeds 100 languages, e.g., 68.0 F1 on XQuAD Indic for the IndicGenBench subset.
5. Multimodality and Language Coverage
Gemma 3-4B is natively multimodal via a frozen 400M parameter SigLIP ViT (896×896), which outputs a sequence of 256 “soft tokens” per image. Images are processed using Pan-and-Scan (P S) for non-square/high-resolution rounds, increasing VQA and DocVQA accuracy. Fusion is achieved by interleaving visual tokens at the text stream position. Supported benchmarks include COCO Caption, DocVQA, InfoVQA, MMMU, and others.
- MedGemma 4B extends the base model with a MedSigLIP encoder (SigLIP fine-tuned on 33M medical image-caption pairs), enhancing medical vision-language capabilities (Sellergren et al., 7 Jul 2025).
- The MedGemma 4B pipeline interleaves MedSigLIP-generated image token embeddings and processes them jointly with text using Gemma’s transformer, supporting tasks such as chest X-ray classification, medical VQA (e.g., SLAKE, VQA-RAD), and radiology report generation.
6. Practical Usage, Scalability, and Known Limitations
- Hardware/Deployment: FP16 weights require ~8GB, KV cache 12.7GB at 32K context; int4 quantization further reduces RAM needs to ~2.6GB (weights) + 7.3GB (KV-cache). A single 24GB GPU suffices for full-context inference with aggressive quantization.
- Efficiency: Sublinear KV-cache scaling with context window extension enables Gemma 3-4B to serve 128K tokens without exponential memory growth.
- Limitations: Model performance and perplexity deteriorate for contexts longer than 128K tokens. Despite RLHF and filtering, hallucinations and exposure to sensitive content remain possible. Vision Pan-and-Scan yields increased inference latency. Memorization audits reveal decreased, but non-zero, approximate memorization versus earlier Gemma generations (Team et al., 25 Mar 2025).
7. Role in Medical Vision-Language Modeling
As the backbone of MedGemma 4B, Gemma 3-4B demonstrates strong out-of-distribution and zero-shot medical performance with 500× less compute than Gemini/o3-class models:
| Benchmark/Task | Gemma 3 4B | MedGemma 4B | Gemini v2.5 Pro | SOTA VLM |
|---|---|---|---|---|
| MedQA (text QA, 0-shot) | 50.7% | 64.4% | n/a | n/a |
| MIMIC-CXR F1 (5 cond.) | 81.2 | 88.9 | 85.8 | 90.7 |
| SLAKE F1 (Med VQA) | 40.2 | 72.3 | 53.1 | 55.5 |
| CXR Report (RadGraph F1) | — | 29.5 | — | 29.5 |
While trade-offs remain on massive scale text QA and OOD generalization, Gemma 3-4B serves as a “sweet spot” for resource-constrained, on-device, or cost-sensitive deployments, with competitive results in both general and specialized multimodal tasks (Sellergren et al., 7 Jul 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free