GRR-CoCa: Enhanced Multimodal Vision-Language Model

Updated 3 July 2026

GRR-CoCa is a multimodal vision-language model that integrates GEGLU activations, RMSNorm, and rotary positional embeddings to enhance both contrastive and generative tasks.
It extends the foundational CoCa architecture by incorporating modern LLM-inspired mechanisms into both its vision transformer encoder and text decoders, improving convergence and stability.
Empirical evaluations show significant gains, with up to a 7% reduction in CoCa loss and measurable improvements in perplexity and contrastive loss on multilingual image-text benchmarks.

GRR-CoCa is an advanced multimodal vision-LLM architecture that systematically incorporates contemporary LLM-inspired mechanisms into both its visual and textual processing stacks. Developed as an improved variant of the Contrastive Captioner (CoCa) foundation model, GRR-CoCa targets superior performance and generalization in contrastive and generative tasks by adopting architectural innovations that have proven highly effective in state-of-the-art LLMs. By jointly modifying both the vision transformer (ViT) encoder and the text decoders with GEGLU activation, RMS normalization, and rotary positional embeddings, GRR-CoCa achieves significant improvements on large-scale multilingual image-text datasets and downstream fine-tuning tasks (Patock et al., 24 Jul 2025).

1. Architectural Innovations of GRR-CoCa

GRR-CoCa preserves CoCa’s core encoder-decoder vision-language paradigm while introducing three pivotal modifications across all model submodules:

Gaussian Error Gated Linear Units (GEGLU): All feed-forward layers in both the ViT encoder and text decoders use GEGLU activation, in which the gate is controlled by a Gaussian error linear unit (GELU). This selective gating enhances model expressiveness and regularizes information flow, empirically reducing perplexity and accelerating convergence compared to conventional GLU or GELU-only activations.
Root Mean Squared Normalization (RMSNorm): RMSNorm replaces LayerNorm across the architecture. This normalization scheme omits mean subtraction, scaling activations solely by their root mean square, resulting in fewer parameters and reduced computation while preserving or improving training stability in deep pre-norm stacks.
Rotary Positional Embedding (RoPE): RoPE is applied to inject relative positional information into all transformer blocks, supplanting learned absolute position embeddings. RoPE allows each attention head to incorporate relative rotations at each position, enabling preservation of positional signals throughout deep contexts, a property particularly beneficial in ViT encoders.

The Baseline CoCa model only leverages these LLM-inspired upgrades in textual decoders; GRR-CoCa extends them comprehensively into the visual (ViT) encoder.

2. Model Structure and Mathematical Formulations

GRR-CoCa’s architecture consists of:

ViT Encoder: Images are split into fixed-size patches, linearly projected, and prepended with a class token. Multiple transformer blocks with self-attention and MLPs (using GEGLU and RMSNorm) process the resultant embedding sequence. RoPE is applied at every attention step.
Textual Decoders: There are two branches: a unimodal autoregressive decoder (for caption generation) and a multimodal decoder that leverages cross-attention over image embeddings for joint modeling.
Pooling Heads: Two forms of self-attention-based pooling project embeddings either to a single vector (for contrastive loss) or to a sequence for generative decoding.

Key mathematical elements:

GEGLU Feed-forward:

$\mathrm{GEGLU}(x) = (x W_a + b_a)\odot \mathrm{GELU}(x W_b + b_b),\;\;\; \mathrm{FFN}(x) = (\mathrm{GEGLU}(x)) W_o + b_o$

RMSNorm:

$\mathrm{RMSNorm}(x) = \gamma \frac{x}{\sqrt{\tfrac{1}{d}\sum_i x_i^2 + \epsilon}}$

Rotary Positional Embeddings:

$\left[q_{2i}, q_{2i+1}\right] \mapsto \begin{bmatrix} \cos(p\theta_i) & -\sin(p\theta_i) \ \sin(p\theta_i) & \cos(p\theta_i) \end{bmatrix} \begin{bmatrix} q_{2i} \ q_{2i+1} \end{bmatrix}$

Losses:
- Contrastive:
$\mathcal{L}_{\rm Con} = -\frac{1}{N}\sum_{i=1}^N \left[\log\frac{\exp(x_i^\top y_i/\sigma)}{\sum_{j=1}^N\exp(x_i^\top y_j/\sigma)} + \log\frac{\exp(y_i^\top x_i/\sigma)}{\sum_{j=1}^N\exp(y_i^\top x_j/\sigma)}\right]$ - Captioning:

$\mathcal{L}_{\rm Cap} = -\sum_{t=1}^T y_t\log\mathrm{softmax}(\hat y_t)\,\mathbf{1}\{y_t\neq\text{ignore}\}$ - Combined:

$\mathcal{L}_{\rm CoCa} = \lambda_{\rm Con}\,\mathcal{L}_{\rm Con} + \lambda_{\rm Cap}\,\mathcal{L}_{\rm Cap}$

3. Training Protocols and Evaluation Benchmarks

Pretraining is conducted on the Conceptual Captions 12M (CC12M) dataset (10.45M train pairs, 0.55M validation), with hyperparameters fixed across models for rigorous ablation:

12 transformer layers per component, 768-dimensional embedding, 12 heads.
Batch size 768 with gradient accumulation, AdamW optimizer, and a learning rate schedule featuring linear warm-up followed by cosine annealing with warm restarts.

Loss weights in pretraining are $\lambda_{\rm Con}=2,\; \lambda_{\rm Cap}=1$ , prioritizing contrastive alignment in early stages.

Fine-tuning is performed on MSCOCO, ROCO v2, and Flickr30K (diverse image-text tasks). Hyperparameters are minimally adapted for each dataset, with loss weights swapped to $\lambda_{\rm Con}=1,\; \lambda_{\rm Cap}=2$ to favor captioning, and regularization/early stopping strategies ensuring sample efficiency.

Evaluation metrics:

Contrastive loss (lower is better) reflects image-text alignment.
Perplexity quantifies generative model quality.
CoCa loss combines both objectives, sensitive to dual-task optimization.

4. Empirical Performance and Analysis

GRR-CoCa achieves substantial benchmarks over Baseline CoCa, as summarized below (Patock et al., 24 Jul 2025):

Metric (Pretraining, CC12M Val)	Baseline CoCa	GRR-CoCa	Relative Improvement
CoCa Loss	3.2864	3.0516	–7.15%
Perplexity	12.9976	12.5151	–3.71%
Contrastive Loss	0.3610	0.2626	–27.25%

During fine-tuning, average relative improvements across MSCOCO, ROCO, and Flickr30K are –13.66% (contrastive loss), –5.18% (perplexity), and –5.55% (CoCa loss), demonstrating generalizability across distinct domains.

Component-level analysis indicates:

GEGLU enables finer gating of expressive features, rapidly reducing generative perplexity and improving cross-modal alignment.
RMSNorm yields increased stability in deep transformer stacks with marginal parameter savings.
RoPE maintains spatial and sequential ordering through all ViT layers, significantly improving image embedding quality and downstream captioning.

Parameter increase across GRR-CoCa is negligible (∼0.17%) compared to Baseline CoCa, indicating performance gains are attributable to architectural enhancements rather than mere scaling.

5. Context, Impact, and Architectural Parity

GRR-CoCa responds to observations that, despite functional similarities, multimodal foundation models had not universally

Markdown Report Issue Upgrade to Chat

References (1)

GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GRR-CoCa.