Gemma Encoder: Decoder-to-Encoder Adaptation
- Gemma Encoder is a model adaptation that converts decoder-only LLMs into bidirectional encoders by replacing causal masks with full attention and adding a task-specific pooling + MLP head.
- It employs various pooling techniques, such as mean and last-token pooling, to efficiently aggregate token representations for classification, regression, and ranking tasks.
- The adaptation demonstrates competitive performance on benchmarks like GLUE, SuperGLUE, and MS MARCO, and supports modular extensions for multimodal and dense embedding applications.
Gemma Encoder refers to a direct architectural adaptation of the Gemma decoder-based LLMs for encoder-dominated tasks in natural language processing, such as classification, regression, and ranking. Developed originally to leverage the pretrained capacity of Gemma’s decoder-only stack, Gemma Encoder unifies the efficiency of bidirectional transformers with transfer learning and remains parameter-compatible with the Gemma open model family. The technique is notable for its simplicity: unlocking bidirectional attention and attaching lightweight pooling + MLP heads enables state-of-the-art performance on key benchmarks, validating the use of large-scale decoder pretraining for non-generative tasks (Suganthan et al., 4 Mar 2025).
1. Architectural Adaptation: From Decoder to Encoder
The primary methodological innovation is a minimalist conversion of the original Gemma decoder stack (2 B or 9 B parameter scales) to a true encoder by three core modifications:
- Attention Masking: The causal mask for , $0$ otherwise, is replaced with a fully-visible (bidirectional) mask for all . This enables every token to attend to every other token, as required for encoder tasks.
- Parameter Structure: All original attention (Q, K, V) and feed-forward (W₁, W₂) weights per transformer block are preserved identically; thus, the self-attention operation remains
but with bidirectional context.
- Task Head: The decoder’s autoregressive output head is dropped. In its place, a pooling module aggregates the last layer’s activations , followed by a randomly initialized, task-specific MLP:
Here, is a fixed-dimensional vector representing the input sequence, after pooling.
No adapters, LoRA modules, or structural modifications beyond this workflow are utilized; the transfer is entirely at the masking and head level (Suganthan et al., 4 Mar 2025).
2. Pooling Mechanisms and Downstream Representation
Gemma Encoder systematically benchmarks five pooling schemes to aggregate tokenwise hidden states from the encoder's final layer:
| Pooling Strategy | Description |
|---|---|
| First- pooling | Concatenate first 0 hidden states: 1 |
| Last-2 pooling | Concatenate last 3 hidden states: 4 |
| Mean pooling | Average over tokens: 5 |
| KV-probe attention | Latent 6 pool over projected 7: 8, 9 |
| Query-probe attention | Perceiver-style, $0$0 |
Empirically, mean pooling or last-token pooling is as effective—or more robust—than the more complex attention-based pooling, especially on GLUE (2B: 89.4 vs. 89.0–89.1; 9B: 90.9 vs. 90.6–90.7). This suggests that representation averaging, or deterministic token selection at sequence end, suffices for most classification/regression use-cases, challenging the necessity of auxiliary parametric pooling for these architectures (Suganthan et al., 4 Mar 2025).
3. Hyperparameter Ablations for Encoder Fine-Tuning
Fine-tuning Gemma Encoder on encoder-dominated tasks involves several hyperparameter investigations:
- Attention Mask: Bidirectional masking is mandatory for effective context use; causal masking degrades performance.
- Dropout Rate: Dropout is applied after attention-softmax and feed-forward outputs. Experiments over $0$1 revealed optimality at $0$2, mitigating overfitting for modest-sample-size benchmarks.
- Padding Direction: Right-padding was selected (though left/right had no impact on results).
- Layer Normalization: RMSNorm with $0$3, retained from the original Gemma pretraining, is used before attention and feed-forward sublayers.
- Other Fixed Parameters: Hidden sizes, number of heads, and all non-head—and non-pooling—weights are preserved from the decoder checkpoints to maintain pretrained inductive bias.
The rationale is that bidirectional masking is necessary for context aggregation; 10% dropout combats overfitting on datasets such as GLUE, which remain small relative to model capacity (Suganthan et al., 4 Mar 2025).
4. Implementation Specifics and Model Scaling
The Gemma Encoder directly inherits the depth and width of the original decoder stacks:
| Model | Layers | $0$4 | $0$5 | $0$6 | Params |
|---|---|---|---|---|---|
| Gemma 2B | 24 | ≈2048 | ≈8192 | ≈32 | ~2B |
| Gemma 9B | 32 | ≈4096 | ≈16384 | ≈64 | ~9B |
The entire model is fine-tuned end-to-end, with randomly initialized pooling + MLP head; no adapters or additional modules are introduced. For ranking tasks (e.g., MS MARCO), input of shape $0$7 is flattened, and scores for each candidate are computed and reassembled into $0$8 for listwise loss computation. The optimizer and schedule follow T5/Adafactor conventions, with code release to match (Suganthan et al., 4 Mar 2025).
5. Empirical Performance: Benchmarks and Comparisons
Gemma Encoder demonstrates high competitiveness across prominent benchmarks:
- GLUE dev average: Gemma 2B achieves an average of 89.2, exceeding T5-large (88.2) and approaching T5-XL (90.1); Gemma 9B attains 90.9, slightly surpassing T5-XXL at a comparable scale.
- SuperGLUE dev average: Gemma 2B and 9B score 85.9 and 90.8, respectively; 9B closes the gap to T5-XXL (91.4).
- MS MARCO ranking: Gemma 2B and 9B obtain MRR@10 scores of 0.4456 and 0.4450, respectively, both exceeding RankT5-XL (0.4358), and NDCG@10 scores similarly exceed the baseline.
The models outperform prior T5-based approaches on both classification and ranking without additional pretraining. A plausible implication is that adaptation of decoder-only models with full bidirectional context is a viable paradigm for encoder-heavy benchmarks, even surpassing architecturally "native" encoder models when scaling is sufficient (Suganthan et al., 4 Mar 2025).
6. Practical Extensions and Broader Applications
The Gemma Encoder adaptation paradigm is foundational for a larger suite of Gemma model repurposing efforts, including:
- Encoder-Decoder Gemma Adaptation: Structured transfer of decoder stacks to encoder–decoder LLMs using the same masking, with cross-attention and joint pretraining/fine-tuning objectives, yielding gains of 5–7% absolute after instruction tuning and flexible parameter allocation across encoder/decoder (Zhang et al., 8 Apr 2025).
- Internal World Modularization: Integration of frozen Gemma 3 middle layers within modular MLP front- and back-ends for efficient representation transfer; ablation studies on wildfire prediction validate the utility of fixed "internal world" representations (Jadouli et al., 20 Apr 2025).
- Multimodal and Dense Embedding Integration: Injection of dense geospatial embeddings as soft tokens into the Gemma LLM backbone for multi-modal reasoning, enabled by lightweight projectors aligning external encoder outputs with Gemma's hidden states (Zhang et al., 8 Apr 2026).
These applications demonstrate that the encoder architecture adaptation is robust, extensible, and generalizes across task modalities and data types.
7. Significance for LLM Development
Gemma Encoder evidences that pretrained decoder-only stacks can be efficiently repurposed for non-generative, encoder-dominated tasks via minimal yet principled architectural modification. Its competitive results on classification, regression, and ranking tasks with direct transfer underscore the utility and scalability of such adaptation. This model family provides a backbone for further encoder–decoder conversion, modular integration, and direct feature reasoning with dense embeddings, contributing to the broader trend of unified, repurposable LLM architectures in downstream NLP (Suganthan et al., 4 Mar 2025, Zhang et al., 8 Apr 2025, Jadouli et al., 20 Apr 2025, Zhang et al., 8 Apr 2026).