GRR-CoCa: Enhanced Multimodal Vision-Language Model
- GRR-CoCa is a multimodal vision-language model that integrates GEGLU activations, RMSNorm, and rotary positional embeddings to enhance both contrastive and generative tasks.
- It extends the foundational CoCa architecture by incorporating modern LLM-inspired mechanisms into both its vision transformer encoder and text decoders, improving convergence and stability.
- Empirical evaluations show significant gains, with up to a 7% reduction in CoCa loss and measurable improvements in perplexity and contrastive loss on multilingual image-text benchmarks.
GRR-CoCa is an advanced multimodal vision-LLM architecture that systematically incorporates contemporary LLM-inspired mechanisms into both its visual and textual processing stacks. Developed as an improved variant of the Contrastive Captioner (CoCa) foundation model, GRR-CoCa targets superior performance and generalization in contrastive and generative tasks by adopting architectural innovations that have proven highly effective in state-of-the-art LLMs. By jointly modifying both the vision transformer (ViT) encoder and the text decoders with GEGLU activation, RMS normalization, and rotary positional embeddings, GRR-CoCa achieves significant improvements on large-scale multilingual image-text datasets and downstream fine-tuning tasks (Patock et al., 24 Jul 2025).
1. Architectural Innovations of GRR-CoCa
GRR-CoCa preserves CoCa’s core encoder-decoder vision-language paradigm while introducing three pivotal modifications across all model submodules:
- Gaussian Error Gated Linear Units (GEGLU): All feed-forward layers in both the ViT encoder and text decoders use GEGLU activation, in which the gate is controlled by a Gaussian error linear unit (GELU). This selective gating enhances model expressiveness and regularizes information flow, empirically reducing perplexity and accelerating convergence compared to conventional GLU or GELU-only activations.
- Root Mean Squared Normalization (RMSNorm): RMSNorm replaces LayerNorm across the architecture. This normalization scheme omits mean subtraction, scaling activations solely by their root mean square, resulting in fewer parameters and reduced computation while preserving or improving training stability in deep pre-norm stacks.
- Rotary Positional Embedding (RoPE): RoPE is applied to inject relative positional information into all transformer blocks, supplanting learned absolute position embeddings. RoPE allows each attention head to incorporate relative rotations at each position, enabling preservation of positional signals throughout deep contexts, a property particularly beneficial in ViT encoders.
The Baseline CoCa model only leverages these LLM-inspired upgrades in textual decoders; GRR-CoCa extends them comprehensively into the visual (ViT) encoder.
2. Model Structure and Mathematical Formulations
GRR-CoCa’s architecture consists of:
- ViT Encoder: Images are split into fixed-size patches, linearly projected, and prepended with a class token. Multiple transformer blocks with self-attention and MLPs (using GEGLU and RMSNorm) process the resultant embedding sequence. RoPE is applied at every attention step.
- Textual Decoders: There are two branches: a unimodal autoregressive decoder (for caption generation) and a multimodal decoder that leverages cross-attention over image embeddings for joint modeling.
- Pooling Heads: Two forms of self-attention-based pooling project embeddings either to a single vector (for contrastive loss) or to a sequence for generative decoding.
Key mathematical elements:
- GEGLU Feed-forward:
- RMSNorm:
- Rotary Positional Embeddings:
- Losses:
- Contrastive:
- Captioning:
- Combined:
3. Training Protocols and Evaluation Benchmarks
Pretraining is conducted on the Conceptual Captions 12M (CC12M) dataset (10.45M train pairs, 0.55M validation), with hyperparameters fixed across models for rigorous ablation:
- 12 transformer layers per component, 768-dimensional embedding, 12 heads.
- Batch size 768 with gradient accumulation, AdamW optimizer, and a learning rate schedule featuring linear warm-up followed by cosine annealing with warm restarts.
Loss weights in pretraining are , prioritizing contrastive alignment in early stages.
Fine-tuning is performed on MSCOCO, ROCO v2, and Flickr30K (diverse image-text tasks). Hyperparameters are minimally adapted for each dataset, with loss weights swapped to to favor captioning, and regularization/early stopping strategies ensuring sample efficiency.
Evaluation metrics:
- Contrastive loss (lower is better) reflects image-text alignment.
- Perplexity quantifies generative model quality.
- CoCa loss combines both objectives, sensitive to dual-task optimization.
4. Empirical Performance and Analysis
GRR-CoCa achieves substantial benchmarks over Baseline CoCa, as summarized below (Patock et al., 24 Jul 2025):
| Metric (Pretraining, CC12M Val) | Baseline CoCa | GRR-CoCa | Relative Improvement |
|---|---|---|---|
| CoCa Loss | 3.2864 | 3.0516 | –7.15% |
| Perplexity | 12.9976 | 12.5151 | –3.71% |
| Contrastive Loss | 0.3610 | 0.2626 | –27.25% |
During fine-tuning, average relative improvements across MSCOCO, ROCO, and Flickr30K are –13.66% (contrastive loss), –5.18% (perplexity), and –5.55% (CoCa loss), demonstrating generalizability across distinct domains.
Component-level analysis indicates:
- GEGLU enables finer gating of expressive features, rapidly reducing generative perplexity and improving cross-modal alignment.
- RMSNorm yields increased stability in deep transformer stacks with marginal parameter savings.
- RoPE maintains spatial and sequential ordering through all ViT layers, significantly improving image embedding quality and downstream captioning.
Parameter increase across GRR-CoCa is negligible (∼0.17%) compared to Baseline CoCa, indicating performance gains are attributable to architectural enhancements rather than mere scaling.
5. Context, Impact, and Architectural Parity
GRR-CoCa responds to observations that, despite functional similarities, multimodal foundation models had not universally