Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gemma 2B Model Overview

Updated 16 January 2026
  • Gemma 2B is a lightweight, open large language model featuring 2–2.6B parameters and a decoder-only Transformer architecture optimized for research and practical deployment.
  • It leverages innovative student–teacher knowledge distillation and in-context learning circuits to boost performance across tasks like code completion and text-to-SQL.
  • Efficient parameterization, mechanistic interpretability via sparse autoencoders, and versatile fine-tuning pipelines enable its use in multimodal systems and circuit discovery projects.

Gemma 2B is a lightweight, high-efficiency open LLM that operationalizes Gemini research into a deployable Transformer backbone. At approximately 2–2.6 billion parameters depending on variant, Gemma 2B implements a decoder-only architecture with substantial engineering optimizations focused on practical deployment, mechanistic interpretability, and cross-task adaptability. The model is the foundational member of the Gemma 2 family and is frequently used as a research substrate for probing Transformer circuits, fine-tuning pipelines, and open-source multimodal systems (Team et al., 2024, Team et al., 2024, Ferrando et al., 2024). Gemma 2B and its immediate successors have been adopted both in upstream code-specialized models (CodeGemma), encoder–decoder adaptations, text-to-SQL systems, multimodal assistants, and circuit discovery projects.

1. Architecture: Core Design and Parameterization

Gemma 2B comprises 18–26 Transformer decoder layers depending on release, with hidden dimension dmodeld_{\mathrm{model}} in the 2048–2304 range and grouped multi-query self-attention (GQA or MQA) for memory and inference speed (Team et al., 2024, Team et al., 2024). Early releases specify 8–16 attention heads per layer, each with 256-d embeddings. The feed-forward MLP stack uses GeGLU activations, a large inner width (often 32768–18432), and RMSNorm for stabilization. Embedding tables are exceptionally large (V=256,128V=256,128 for pretraining release, and up to 256k in multimodal variants), supporting context lengths of 8192 tokens.

Later Gemma 2B variants (Gemma 2) interleave local sliding-window (w=4096w=4096) and global full-context attention every other layer, exploiting O(w)\mathcal{O}(\ell \cdot w) scaling. They apply logit soft-capping (c=50c=50 for self-attention, c=30c=30 for LM head) and rotary position encodings (RoPE). The architectural formula for parameter count is:

Pembed=V×D Ptransformer=L((H+2)Ddhead+D2+2DF) Ptotal=Pembed+Ptransformer\begin{aligned} P_{\mathrm{embed}} &= V \times D \ P_{\mathrm{transformer}} &= L((H+2)D d_{\text{head}} + D^2 + 2 D F) \ P_{\mathrm{total}} &= P_{\mathrm{embed}} + P_{\mathrm{transformer}} \end{aligned}

Typical total parameters: \sim2.0–2.6 billion.

2. Training Methodologies

Gemma 2B is pretrained on English web text, open-source code, and scientific corpora, using a SentencePiece tokenizer with byte-level fallback and digit splitting. Data curation filters out toxic or personal material and up-weights clean data late in training (Team et al., 2024, Team et al., 2024).

A major innovation of Gemma 2 is student–teacher knowledge distillation at scale: the 2B and 9B models learn from teacher soft logits rather than simple next-token prediction, minimizing:

LKD=Exc[xPT(xxc)logPS(xxc)]\mathcal{L}_{\text{KD}} = \mathbb{E}_{x_c}\Big[-\sum_{x} P_T(x|x_c)\log P_S(x|x_c)\Big]

with PTP_T from a larger teacher (often 7B–27B). Gemma 2B is exposed to 2T–3T tokens during training.

Instruction-tuning for Gemma 2B-IT involves supervised fine-tuning (SFT) on curated English prompt–response mixes and RLHF with Bradley–Terry rewards.

Significant practical engineering extends to TPUv5e hardware, optimizer sharding (ZeRO-3), and carbon-neutral compute. All major checkpoints are released for downstream fine-tuning or probe development.

3. Mechanistic and Cross-Linguistic Circuitry

Gemma 2B is extensively studied for its low-dimensional circuit implementations. Subject-verb agreement (SVA) investigations reveal a highly universal mechanism: a single attention head (L13H7) writes a "subject number" direction (found as PCA/PC1 in the residual stream) read by individual MLP neurons (e.g., 2069 in MLP13). This direction is language-independent and can be steered or patched to causally flip model outputs in both English and Spanish (Ferrando et al., 2024).

The exact circuit:

  • Signal writing: r13=r12+WOAttnOut13,7+r_{13} = r_{12} + W_O \mathrm{AttnOut}^{13,7} + \ldots, with AttnOut13,7\mathrm{AttnOut}^{13,7} aligned to dd (subject number).
  • Signal reading: Gated-MLP neuron kk in layer 13 computes activationk=inkGeLU(gatek)activation_k = in_k \circ \mathrm{GeLU}(gate_k), with dWin[:,2069]d^\top W_{in}[:,2069] controlling verb plurality.

The same principle generalizes across Gemma 1/2 and 7B models, with >0.9 cosine similarity in PC1 directionality across languages and variants. Causal interventions (activation patching, steering) validate the circuit's necessity and sufficiency for SVA.

4. In-Context Learning and Task Circuitry

Detailed analysis of Gemma-2 2B for in-context learning reveals a two-stage "contextualize-then-aggregate" algorithm (Bakalova et al., 31 Mar 2025). In lower layers, heads cross-attend between few-shot example tokens, building up contextualized representations that encode input type, output type, and sometimes higher-order functional mappings. In higher layers, specialized "function-vector" heads aggregate across output examples to synthesize a task vector driving prediction.

Causal patching experiments identify which connections are essential for task generalization, showing that aggregation circuits alone are insufficient in ambiguous settings—contextualization across examples is critical.

5. Fine-Tuning, Adaptation, and Specialized Variants

Gemma 2B is further adapted into encoder–decoder architectures (2B-2B) via parameter sharing and introduction of cross-attention modules. The adaptation uses copied self-attention matrices and is trained via PrefixLM+KD and UL2 objectives. This yields substantial efficiency and quality improvements: +7 points on instruction-tuned benchmarks and +12.6 points on SuperGLUE versus decoder-only Gemma 2B, under equal inference budgets (Zhang et al., 8 Apr 2025).

Code-specialized variants, notably CodeGemma 2B, extend the base backbone with fill-in-the-middle control tokens, aggressive code-only pretraining (up to 1T tokens), and fast IDE-ready inference. CodeGemma 2B demonstrates state-of-the-art code completion and infilling among open 2B models at 2× the speed of competitors (Team et al., 2024).

Downstream applications include text-to-SQL (GEMMA-SQL), multimodal systems (LLaVA-Gemma), and prompt-recovery pipelines (as in Gemma-2b-it + Phi2) (Pandey et al., 5 Nov 2025, Hinck et al., 2024, Chen et al., 2024).

6. Evaluation: Benchmarks, Robustness, and Safety

Gemma 2B achieves competitive performance for its scale:

  • MMLU (5-shot): 42.3% (Gemma 1 2B) to 52.2% (Gemma 2 2B)
  • ARC-C (25-shot): up to 55.7%
  • GSM8K (5-shot): up to 24.3%
  • HellaSwag, PIQA, SIQA, BBH, Winogrande, MBPP, HumanEval, MATH benchmarks tracked (Team et al., 2024, Team et al., 2024)

In human-preference safety tests, Gemma 2B IT attains a 60.1% win-rate vs. Mistral 7B for safety and a 45.0% win-rate for instruction following, with confidence intervals detailed in original tables.

Symbolic vulnerability remains an unresolved challenge: modifiers and named entities trigger hallucinations at rates of 84.76–94.98% in QA-format tests on HaluEval and TruthfulQA; only modest scaling improvements are observed in larger Gemma (27B) models (Lamba et al., 9 Sep 2025). Attention-score analyses tie hallucination to local representational fragility and failure to robustly encode discrete symbolic operators.

Safety benchmarking shows Gemma 2B IT competitive with PaLM/PaLM 2 models, maintaining low memorization and favorable metrics on RealToxicity, CrowS-Pairs, BBQ, and TruthfulQA (Team et al., 2024).

7. Interpretability, Sparse Autoencoders, and Practical Deployment

Gemma Scope introduces over 400 JumpReLU sparse autoencoders (SAEs) trained on all layers, making model analysis approachable (Lieberum et al., 2024). SAEs decompose activations into tens of thousands of sparse, monosemantic features. Key metrics:

  • Sparsity L050L_0 \sim 50 yields Δ\Delta LM loss \sim0.02 nats/token.
  • Fraction of variance unexplained \sim 0.15 at high sparsity.
  • 60–70% of sampled features judged "meaningful" in human annotation.

SAE latents reveal interpretable circuits for syntactic and semantic phenomena. These modules are fully plug-in as hooks for TransformerLens, facilitating circuit-level interventions and modular analysis.

Gemma 2B is released with deployment scripts supporting single-GPU inference (<16 GB VRAM), adapter-based fine-tuning (LoRA), and integration into Keras/HuggingFace pipelines. Multimodal and code variants extend Gemma 2B into vision-language and software completion domains respectively, always leveraging its compact, efficient backbone.


References:

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gemma 2B Model.