T5Gemma 2: Efficient Multilingual & Multimodal LLM

Updated 19 December 2025

T5Gemma 2 is an open encoder–decoder LLM known for its advanced multilingual, multimodal, and long-context processing capabilities.
It utilizes the UL2 adaptation strategy with tied word embeddings and merged attention, achieving parameter efficiency with minimal quality loss.
Empirical evaluations show that T5Gemma 2 outperforms previous models in long-context and multilingual tasks while maintaining robust multimodal performance.

T5Gemma 2 is an open encoder–decoder LLM designed to combine strong multilingual, multimodal, and long-context processing capabilities within a computationally efficient architecture. Developed as the next iteration of the T5Gemma family, T5Gemma 2 leverages the "UL2" adaptation strategy to convert decoder-only LLM checkpoints into encoder–decoder models, generalizing this approach to multimodal and long-context tasks. The model architecture incorporates parameter-efficient techniques, such as tied word embeddings and merged attention modules, while retaining and extending the empirical benefits of the encoder–decoder paradigm compared to decoder-only variants in open LLMs (Zhang et al., 16 Dec 2025).

1. Model Architecture and Adaptation Methods

T5Gemma 2 employs a structured adaptation of pretrained Gemma 3 decoder-only weights into an encoder–decoder Transformer via the UL2 (Unifying Language Learning) methodology. This process entails duplicating the original Transformer's weights into parallel "towers" for the encoder and the decoder. The resulting architecture applies the UL2 denoising objective, introducing five text span corruption patterns (as in Tay et al. 2022) over each input–target pair. This pretraining both warms up encoder representations for bidirectional context and maintains autoregressive generation in the decoder.

Two efficiency-driven architectural innovations distinguish T5Gemma 2:

Tied Word Embedding: All embedding operations (encoder input, decoder input, and decoder softmax) share a single lookup matrix $E \in \mathbb{R}^{|V| \times d}$ , eliminating more than 10% of embedding parameters with negligible loss in quality (Table 2).
Merged Attention Module: The decoder’s masked self-attention and encoder cross-attention are unified into a joint module. For decoder inputs $X \in \mathbb{R}^{m \times d}$ and encoder outputs $H \in \mathbb{R}^{n \times d}$ , joint queries, keys, and values are constructed via

$Q = X W_q,\quad K = [X; H] W_k, \quad V = [X; H] W_v,$

and attention computed as

$\mathrm{A} = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_h}} \odot M\right), \quad O = A V W_o,$

where $M$ provides future masking for decoder tokens and full attention to encoder output. This unification reduces parameter count by approximately 6.5% at a cost of only 0.3 points in pretraining quality (Table 1), and brings the decoder closer to the Gemma 3 profile.

Base Transformer components include grouped-query attention with QK-norm, pre- and post-layer RMSNorm, Rotary Position Embeddings (RoPE), and a hybrid schedule of local and global attention.

2. Multimodal Integration and Long-Context Handling

T5Gemma 2 extends its encoder–decoder backbone into a multimodal regime following Gemma 3, incorporating a frozen SigLIP vision encoder (400M parameters). Visual inputs are encoded as 256 tokens and prepended to text tokens presented to the encoder. Bidirectional encoder attention permits complete text–vision token interaction.

To improve long-context modeling, RoPE base frequencies are set to $10\,\mathrm{k}$ for local attention and $1\,\mathrm{M}$ for global layers. When evaluating on contexts up to 128k tokens, positional interpolation (as per Chen et al. 2023) is used to mitigate context length degradation. The encoder–decoder organization and the presence of global layers contribute to a substantial advantage over decoder-only models for long-context understanding.

3. Pretraining Corpus and Post-training Techniques

Pretraining is conducted on a corpus of approximately two trillion tokens, comprising multilingual web text, source code, mathematical content, and matched image–text pairs. Text pretraining employs the UL2 denoising mix with varying corruption spans (lengths 3, 12, 32, and $3/4L$, with masking ratios from 15% to 75%). For vision–language tasks, a prefix-LM scheme is used: the encoder sees all tokens up to and including the image, and the decoder predicts subsequent text.

Optimizer settings include a global batch size of 4.2M tokens, cosine learning rate decay (with 100 warmup steps), gradient clipping at 1.0, weight decay, and checkpoint averaging over the final five steps.

Post-training consists of lightweight instruction tuning via distillation only, using substantially less compute than Gemma 3’s full post-training pipeline and omitting reinforcement learning. This sequence demonstrates the inherent preference of encoder–decoder architectures for instruction-following tasks.

4. Empirical Evaluation and Task Performance

Pretraining and post-training metrics demonstrate the efficacy of the T5Gemma 2 modifications. At the 270M parameter scale, T5Gemma 2 outperforms Gemma 3 by 1.0 points on average for reasoning/factuality, 2.0 points on STEM/code, 5.8 points on multilingual tasks, and provides nontrivial scores for multimodal tasks (unavailable in Gemma 3’s text-only baseline). The most pronounced advantage is in long-context modeling, with a 19.7-point improvement (long-context avg. 31.7 vs. 12.0).

Instruction tuning further extends these gains: with only modest distillation, T5Gemma 2 at 270M parameters surpasses Gemma 3 of the same size by +18 points in long-context tasks and by double-digit margins in reasoning and code generation. At the 4B scale, T5Gemma 2 matches or exceeds Gemma 3 on nearly all pre- and post-training evaluations.

5. Model Scaling and Efficiency Trade-offs

The released configurations are summarized as follows:

Configuration	Embeddings	Encoder	Decoder	Vision Encoder
270M–270M	168M	100M	100M	417M
1B–1B	302M	698M	698M	417M
4B–4B	675M	3.2B	3.2B	417M

Tied embeddings and merged attention modules yield a roughly 10% and 6.5% reduction in total parameters, respectively. These adaptations enable the encoder–decoder dual-tower structure without appreciable increases in total parameter count compared to Gemma 3.

A plausible implication is that the parameter-neutral cost of encoder–decoder adaptation, combined with enhanced long-context and multimodal performance, makes such architectures highly competitive for resource-constrained or instruction-following LLM deployments.

6. Significance and Research Directions

T5Gemma 2 demonstrates that encoder–decoder LLMs can be efficiently synthesized from decoder-only checkpoints by employing parameter sharing and architectural unification, while extending to vision and long-context tasks with minimal overhead (Zhang et al., 16 Dec 2025). Its empirical advantages on pretraining and post-training metrics, especially in long-context and multimodal benchmarks, reinforce the enduring strengths of the encoder–decoder paradigm within the current landscape of large open LLMs.

Possible further research includes exploring more efficient unified attention structures, deeper multimodal pathways, and specialized architectures for extreme long-context temporality or retrieval augmentation. The model suite’s availability across parameter scales enables future comparative study of such techniques in open settings.

PDF Markdown Chat (Pro)

References (1)

T5Gemma 2: Seeing, Reading, and Understanding Longer (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to T5Gemma 2.