Gemma 2B: Open 2B Language Models
- Gemma 2B is an open, state-of-the-art 2-billion parameter language model family utilizing a decoder-only Transformer with rotary embeddings and GeGLU activation.
- It supports versatile variants such as CodeGemma 2B for code completion, encoder-decoder adaptations, and multimodal implementations, all optimized for efficiency.
- The model leverages knowledge distillation from a 7B teacher and innovative attention mechanisms to achieve competitive performance with reduced latency.
Gemma 2B is a family of open, state-of-the-art LLMs occupying the 2 billion parameter regime, developed as part of the Gemma and Gemma 2 research lines. The architecture underpins a range of compact transformer-based models, including the primary decoder-only LLM, specialized code models such as CodeGemma 2B, and variants for multimodal, in-context learning, and interpretability research. Gemma 2B and its derivatives are designed for high efficiency, fast inference, and broad academic and industrial applicability, with open weights and comprehensive interpretability tooling.
1. Architectural Foundations and Model Variants
Gemma 2B is fundamentally a decoder-only Transformer but integrates several architectural innovations aimed at maximizing performance at practical model sizes (Team et al., 2024). The prevailing configuration for Gemma 2B in Gemma 2 is:
- Model dimension:
- Number of layers: 26
- Feed-forward dimension:
- Attention heads: 8 query heads, 4 key-value heads
- Head dimension: 256
- Context length: up to 8192 tokens
- Positional encoding: Rotary embeddings (RoPE)
- Normalization: RMSNorm before and after each sublayer
- Feed-forward activation: GeGLU
The total parameter count, inclusive of the embedding and bias tables, is approximately 2 billion.
Four primary variants and applications derived from the Gemma 2B backbone are present in the literature:
- Gemma 2B decoder-only: Core language modeling and reasoning (Team et al., 2024Team et al., 2024).
- CodeGemma 2B: Specialized code completion and infilling model (Team et al., 2024).
- Encoder-Decoder adaptation: Balanced (2B-2B) and unbalanced (e.g., 9B-2B) encoder-decoder hybrids optimized for quality–efficiency tradeoff (Zhang et al., 8 Apr 2025).
- Gemma Scope: Suite of jump-relu sparse autoencoders trained on all layers for interpretability (Lieberum et al., 2024).
- Other specialized studies: Include analysis of symbolic hallucination (Lamba et al., 9 Sep 2025), in-context learning circuits (Bakalova et al., 31 Mar 2025), and multimodal adaptation (e.g., LLaVA-Gemma-2B (Hinck et al., 2024)).
2. Training Methodologies and Objectives
Core LLM (Gemma 2B)
The principal Gemma 2B model in Gemma 2 is trained via knowledge distillation from a 7B teacher, rather than standard next-token prediction. The distillation loss is
where is the teacher output distribution and the student’s (Gemma 2B’s) prediction (Team et al., 2024). This approach enriches the training signal and leads to substantial performance gains over next-token cross-entropy, especially for compact models.
Data: 2 trillion tokens (web, code, scientific text), 256k-token SentencePiece vocabulary, extensive unsafe-content and academic evaluation exclusion.
Compute: 512 TPUv5e chips, 450k steps at batch sizes of 512 contexts.
CodeGemma 2B
Designed exclusively for code completion, CodeGemma 2B leverages aggressive fill-in-the-middle (FIM) objectives and pure-code pretraining. Two versions exist: v1.0 (500 B tokens, 80% FIM rate) and v1.1 (1 T tokens, 90% FIM rate) (Team et al., 2024). The dataset is strictly filtered and de-duplicated for code, with multi-file context packing.
Instruction tuning: Not applied to CodeGemma 2B; only the 7B variant is supervised-finetuned.
Encoder-Decoder Adaptation
Adapts the 2B decoder into a dual-tower encoder-decoder architecture (yielding 4B parameters) by cloning and bidirectionalizing the original layers and introducing cross-attention. Key objectives:
- PrefixLM: Causal generation with a prefix, optionally with KL distillation from the base decoder.
- UL2: Mixed denoising tasks for richer bidirectional representations.
Adaptation enables substantially improved instruction-tuned performance and bidirectional contextualization with no increase in FLOPs over the decoder-only baseline (Zhang et al., 8 Apr 2025).
3. Benchmark Performance and Empirical Trade-offs
General NLP and Reasoning
Gemma 2B exhibits a strong size-adjusted performance profile. On tasks such as MMLU, ARC-C, GSM8K, HellaSwag, BoolQ, and code (HumanEval, MBPP), it narrows or closes the gap with models 2–3x larger (e.g., Mistral 7B, LLaMA-3-8B) (Team et al., 2024Team et al., 2024):
| Task | Metric | Gemma 2B | Mistral 7B | LLaMA-3 8B |
|---|---|---|---|---|
| MMLU | 5-shot | 52.2 | 62.5 | 66.6 |
| GSM8K | 5-shot | 24.3 | 39.6 | 45.7 |
| HumanEval | pass@1 | 20.1 | 26.2 | — |
| MBPP | pass@1 | 29.2 | 40.2 | — |
CodeGemma 2B attains 78–79% pass@1 on single-line HumanEval infilling, surpassing or matching open 2B-class models in both accuracy and 1.5–2× faster inference for latency-sensitive code-completion (Team et al., 2024).
Encoder-Decoder Improvement
Instruction-tuned encoder-decoder Gemma 2B-2B models yield an absolute +7.4 point uplift in IT average scores over decoder-only, with SuperGLUE RLHF scores up to 90.5% (UL2) vs. 88.3% (PrefixLM) (Zhang et al., 8 Apr 2025).
4. Efficiency, Latency, and Deployment
Gemma 2B’s architectural focus is inference efficiency:
- Group-Query Attention (GQA) reduces the number of KV heads, improving memory and computation efficiency by 15–25% in latency versus standard MHA (Team et al., 2024).
- Interleaved local-global attention optimally balances long and short-range dependency modeling while minimizing overhead (Team et al., 2024).
- Sliding-window context reduction (4096→2048 tokens) increases decoding speed with negligible effect on perplexity () (Team et al., 2024).
CodeGemma 2B is specifically engineered for sub-second, low-latency IDE plugin and edge-deployment use cases by pretraining exclusively on code and maximizing FIM (Team et al., 2024).
In real-world benchmarks, CodeGemma 2B achieves an average throughput of ≈0.24 tokens/s on 128-token single-line infilling, 1.5–2× faster than competitive models of similar size.
Encoder-decoder adaptation enables matching decoder-only latency (FLOPs parity) while delivering improved performance, especially useful for on-device and input-heavy applications (Zhang et al., 8 Apr 2025).
5. Hallucination and Symbolic Processing Limitations
Empirical analysis reveals a pronounced vulnerability of Gemma-2-2B to symbolic triggers. Across HaluEval and TruthfulQA, overall hallucination rates average 79.0%, with sub-categories such as modifiers and named entities remaining most error-prone (up to 94.98% per-property) (Lamba et al., 9 Sep 2025). While scaling to 9B and 27B models reduces overall hallucination (~15 points improvement), even the largest Gemma 27B model displays over 63% symbolic hallucination.
Mechanistically, these errors are attributed to insufficient self-attention focus on symbolic tokens in mid-to-late transformer layers, producing representational instability. The structural nature of this weakness suggests that neither increased scale nor Gemma’s specific local/global attention or distillation strategies fully resolve symbolic confabulation.
Mitigations proposed include mechanistic interpretability, explicit symbolic grounding in training, or architecture-level interventions (Lamba et al., 9 Sep 2025).
6. Interpretability and Mechanistic Analyses
In-Context Learning Circuits
Mechanistic investigation into Gemma-2 2B uncovers a two-stage in-context learning process: initial contextualization of few-shot examples in lower layers (information exchange among inputs/outputs), then aggregation of these representations in upper layers via function-vector heads (Bakalova et al., 31 Mar 2025). This contextualize-then-aggregate strategy varies in importance by task ambiguity, with robust mechanisms for resolving ambiguous demonstration sets.
Sparse Autoencoder Decomposition
Gemma Scope provides comprehensive JumpReLU sparse autoencoders for all layers and sublayers of Gemma 2 2B, exposing interpretable feature circuits (Lieberum et al., 2024). With >400 SAEs published, this resource facilitates large-scale causal and interpretability analyses. Attaching SAEs achieves high fidelity (Δ LM loss < 0.02 nats/token in attention/MLP at 50 active latents/layer 12) with minimal loss when ported to instruction-tuned variants. The large dictionary widths (>1M) demonstrate fine-grained "feature splitting," supporting fine mechanistic dissection.
7. Extensions: Multimodality and Alternative Architectures
Multimodal: LLaVA-Gemma-2B
Within the LLaVA framework, Gemma-2B acts as a compact language backbone for image-text multimodal models, incorporating a connector MLP for patch-embedding alignment. Although fast and cost-effective, LLaVA-Gemma-2B does not exceed contemporaneous SOTA 2B multimodal models, with performance strongly contingent on connector pretraining and vision backbone co-tuning (Hinck et al., 2024).
Non-Transformer: RecurrentGemma-2B
RecurrentGemma-2B leverages an alternative Griffin architecture, combining linear recurrences (RG-LRU) and local attention, yielding constant-sized internal state and eliminating transformer’s unbounded KV cache (Botev et al., 2024). On downstream benchmarks, it matches Gemma-2B despite using 50% fewer tokens, and achieves 2–3× higher sampling throughput for long-context tasks.
References
- CodeGemma: (Team et al., 2024)
- Gemma 2: (Team et al., 2024)
- Symbolic Hallucination: (Lamba et al., 9 Sep 2025)
- In-Context Learning: (Bakalova et al., 31 Mar 2025)
- Encoder-Decoder Adaptation: (Zhang et al., 8 Apr 2025)
- Sparse Autoencoders: (Lieberum et al., 2024)
- Multimodal MMFM: (Hinck et al., 2024)
- RecurrentGemma: (Botev et al., 2024)