Gemma-2-2B: 2.6B-Parameter Open LLM

Updated 3 July 2026

Gemma-2-2B is a 2.6B-parameter decoder-only transformer designed for optimal memory, speed, accuracy, and extensibility.
It leverages innovative knowledge distillation and a refined transformer architecture to achieve state-of-the-art performance on language, reasoning, code, and text-to-SQL tasks.
Applications include code completion, text-to-SQL, and multimodal tasks, with detailed analysis of in-context learning mechanisms and safety behaviors.

Gemma-2-2B is the canonical 2.6 billion parameter variant in the Gemma 2 family of open LLMs, introduced and maintained primarily by DeepMind and Google Research. This model is engineered for performance density and accessibility, reflecting a careful trade-off among memory, speed, accuracy, and architectural extensibility at the 2B parameter scale. It is widely adopted in both foundational research and as the core for derivative models spanning code generation, text-to-SQL, safety auditing, and interpretability. Gemma-2-2B is characterized by a highly optimized transformer architecture with state-of-the-art parameter efficiency techniques, substantial pretraining via knowledge distillation, and a focus on practical deployment constraints such as memory and latency. The model’s behaviors with respect to hallucination, in-context learning, code generation, multimodal extension, and safety have been subjected to rigorous causal, mechanistic, and benchmark-driven analysis.

1. Architecture and Training Regimen

Gemma-2-2B is a decoder-only transformer with 26 layers, model dimension $d_\text{model} = 2304$ , feed-forward width $d_{ff} = 18432$ , 8 attention heads (GQA, 4 key-value heads), and context length up to 8192 tokens. RMSNorm and GeGLU gating are used throughout; rotary positional embeddings (RoPE) provide relative position encoding. The model alternates local sliding-window attention (window 4096) and global full-span attention every other layer, leveraging logit soft-capping to control numerical stability. Grouped-Query Attention (GQA) partitions heads into two groups, each sharing KV projections, reducing memory and compute relative to classic multi-head attention (Team et al., 2024).

A distinguishing feature of Gemma-2-2B is its exclusive reliance on knowledge distillation during pretraining. It is trained not by next-token prediction from scratch, but by minimizing cross-entropy between the student distribution and a larger teacher's token probabilities: $\mathcal{L}_{KD} = \sum_{t=1}^T -P_T(x_t|x_{<t})\log P_S(x_t|x_{<t};\theta).$ Training spans 2 trillion tokens, with filtering to exclude unsafe, private, or evaluation-set data, using a 256,128-size SentencePiece vocabulary (Team et al., 2024).

Key architectural summary:

Property	Value
Layers	26
Hidden Size	2304
FFN Inner Dim	18432
Attn Heads (Query/KV)	8 / 4 (groups = 2, GQA)
Context Length	8192
Parameters (total)	2.61 B

Compute infrastructure involves TPU v5e clusters, ZeRO-3 optimizer state partitioning, and JAX+GSPMD for scalable training (Team et al., 2024).

2. Performance on Benchmarks and Downstream Tasks

Gemma-2-2B achieves best-in-class performance among open 2B-scale models and closes the gap partially to larger 7–8B models. On standard language and reasoning tasks, it reports (Gemma-2-2B pre-trained):

MMLU (5-shot): 52.2%
GSM8K (5-shot): 24.3%
Winogrande (5-shot): 71.3%
ARC-Challenge (25-shot): 55.7%

Instruction tuning further boosts scores; e.g., MMLU rises to 56.1%, MBPP pass@1 to 36.6% (Team et al., 2024).

Applications include:

Code Completion: CodeGemma 2B, a specialized code model, achieves HumanEval infilling pass@1 of 79.3% (v1.1, single-line), MBPP 49.2%, and is competitive with DeepSeek Coder 2B and StarCoder2 2B, while maintaining sub-8GB VRAM requirements and fast generation (≈240 tokens/sec) (Team et al., 2024).
Text-to-SQL: GEMMA-SQL, fine-tuned from Gemma-2-2B, achieves 66.8% Test-Suite and 63.3% Exact Set Match accuracy on SPIDER, outperforming IRNet and RYANSQL and matching/reaching CodeXDavinci on exactness (Pandey et al., 5 Nov 2025).
Multimodal (VLM): With CLIP or DinoV2 vision backbones, LLaVA-Gemma-2B attains 71.4% VQAv2 and matches or surpasses Phi-2B on GQA, with strong performance for its parameter count, assuming connector pretraining and, for reasoning, use of DinoV2 (Hinck et al., 2024).

3. Mechanisms of In-Context Learning and Model Circuits

Causal analysis of Gemma-2-2B's few-shot learning circuits reveals a two-step neural pipeline:

Contextualization (Layers 1–8): Representations of each example in the prompt are enriched by input–input and output–output (e.g., $x_i \rightarrow x_{i+1}$ , $y_i \rightarrow y_{i+1}$ ) interexample connectivity, encoding both positional and type information.
Aggregation (Layers 12–20): Task vectors are constructed at the query position by aggregating contextualized outputs from all examples, enabling function inference and generalization to ambiguous tasks.

Quantitative circuit ablations show that “contextualize-then-aggregate” recovers $>$ 90% of full-model accuracy versus parallel circuits, with the contextualization step critical for resolving semantic ambiguities (Bakalova et al., 31 Mar 2025).

4. Hallucination Dynamics and Vulnerabilities

Large-scale evaluation exposes a structural blind spot in Gemma-2-2B regarding prompts containing symbolic triggers such as modifiers, named entities, numbers, negation, and exceptions. On reformatted HaluEval and TruthfulQA datasets (n=600), the model yields an average hallucination rate of 79%, with modifiers and named entities both triggering $>$ 83% rates individually; exceptions lead to rates up to 100% (Lamba et al., 9 Sep 2025). This remains acute even in Gemma-2-27B, which retains a 63.9% average hallucination.

Analysis ties this to:

Diminished attention allocation: In mid-to-deep layers, symbolic tokens receive systematically lower attention in MCQ and odd-one-out formats versus open QA, aligning with elevated hallucination.
Activation instability: Hallucination rates peak in prompts of 10–30 tokens, reflecting unstable symbolic representations.

Proposed mitigation strategies include activation patching, prompt engineering that highlights symbolic tokens, and potentially new training objectives oriented to symbolic grounding (Lamba et al., 9 Sep 2025).

5. Interpretability and Sparse Feature Decomposition

Gemma Scope delivers comprehensive JumpReLU sparse autoencoders (SAEs) for Gemma-2-2B at all major network sites. Mid-network SAEs trained with strong L₀ penalties extract compact, human-interpretable feature sets:

Layer 12, width 131k: mean L₀ $\approx$ 20 with fraction-of-variance-unexplained (FVU) ≈ 0.14–0.18, ΔLM loss ≈ 0.018–0.048.
Human raters identified 70–80% of wide SAEs’ latents as interpretable (e.g., punctuation, function words, capitalization) (Lieberum et al., 2024).
SAE latents and LM-generated explanations correlate at Pearson $r\sim0.6$ .

SAEs are deployable for intervention, safety-focused feature steering, and causal mediation, facilitating model analysis far beyond standard post hoc methods (Lieberum et al., 2024).

6. Safety, Refusal Behavior, and Auditability

Biosecurity refusal audits using both external (surface label) and internal (SAE latent) metrics demonstrate that Gemma-2-2B-IT's refusal is highly format- and length-dependent. Under minimal prompts, the model universally hedges rather than refuses, regardless of biological hazard tier. Only with explicit prompts does refusal emerge (up to 67% on hazard, Table 2). Constraining generation to 80 tokens collapses refusal rate to 0%, indicating surface compliance is budget-fragile (DeLeeuw, 28 May 2026).

SAEs tuned to biology-specific features confirm that refusal–compliance separation is weak; “refusal” features often fire on benign or legally salient but non-hazardous queries (e.g., psilocybin instructions), suggesting that the refusal circuitry primarily tracks cultural/legal salience rather than intrinsic hazard. Mean D-score (internal–surface feature divergence) increases with hazard tier but fail to show robust comply–refuse separation in Gemma-2-2B, underscoring the brittleness of its safety behaviors (DeLeeuw, 28 May 2026).

7. Adaptations, Extensions, and Applications

Gemma-2-2B has served as the backbone for multiple architectural and application layer innovations:

Encoder-decoder adaptation: Splitting the pretrained 2B stack into a symmetric encoder–decoder configuration yields Gemma-2B-2B, with new cross-attention insertion and modest adaptation. This variant delivers up to $\sim$ 7% higher instruction-tuned performance and 12.6 points higher SuperGLUE score, with comparable inference latency and memory (Zhang et al., 8 Apr 2025).
Code-generation specialization: CodeGemma 2B introduces fill-in-the-middle (FIM) sentinel tokens and code-only pretraining, achieving state-of-the-art code infilling and text-to-code translation for 2B-scale models (Team et al., 2024).
Text-to-SQL modeling: GEMMA-SQL leverages LoRA adapters, schema-aware prompting, and iterative re-prompting to achieve 66.8% test-suite accuracy, all with <16 GB VRAM requirements (Pandey et al., 5 Nov 2025).
Multimodal deployments: LLaVA-Gemma-2B fuses a pre-trained vision backbone (CLIP/DinoV2) into the text stack via a lightweight MLP connector, enabling VQA and multimodal instruction following at modest compute and memory footprints (Hinck et al., 2024).

Gemma-2-2B thus represents a widely extensible and interpretable foundation, balancing modern transformer efficiency with broad applicability to code, natural language, multimodal, and specialized safety settings.