Contrastive Captioners (CoCa): Unified Multimodal Models

Updated 21 December 2025

The paper introduces a unified Transformer-based model that combines CLIP-style contrastive learning with autoregressive captioning to achieve state-of-the-art multimodal performance.
It employs a dual objective loss that integrates the InfoNCE contrastive loss with an autoregressive captioning loss, efficiently computed in a single forward pass.
The model's design enables rapid adaptation to diverse tasks such as visual recognition, retrieval, and medical imaging with minimal computational overhead.

Contrastive Captioners (CoCa) are a class of large-scale multimodal foundation models that unify global contrastive representation learning and autoregressive image-to-text generation in a single Transformer-based framework. By integrating the strengths of CLIP-style contrastive learning with encoder-decoder captioning, CoCa-based models establish powerful and transferable representations across vision and language modalities (Yu et al., 2022). They have demonstrated superior performance on a broad range of benchmarks in visual recognition, retrieval, captioning, multimodal reasoning, as well as medical and video domains.

1. Model Architecture and Representation Learning

CoCa is architected with a minimal design that maximally unifies contrastive and generative objectives within a single Transformer backbone:

Image Encoder: A Vision Transformer (ViT) splits the input image into patches, projects each patch to a $d$ -dimensional embedding, adds positional encodings, and processes the sequence with $L_\text{img}$ self-attention layers to yield $P$ output tokens (Yu et al., 2022).
Text Encoder (“Unimodal Decoder”): CoCa’s decoder is split: the first $L_\text{uni}$ layers process the text in a causal-masked (unimodal) fashion, producing both per-token hidden states and a [CLS] token for global text embedding. There is no cross-attention for these layers.
Multimodal Decoder: The remaining $L_\text{multi}$ layers employ both causal self-attention and cross-attention over image encoder outputs, fusing modalities. The multimodal decoder generates captions autoregressively.
Attentional Poolers: For contrastive alignment, a 1-query attentional pooler operates over image tokens; for captioning, a 256-query pooler extracts fine-grained visual tokens as cross-attention keys for the decoder.
Computation Sharing: By splitting the decoder, CoCa efficiently derives both unimodal (contrastive) and multimodal (captioning) representations from a single forward pass.

These architectural choices allow CoCa to project images and text into a shared embedding space for global alignment, and also to produce fused representations for autoregressive generation and downstream multimodal tasks (Yu et al., 2022).

2. Training Objectives and Loss Functions

CoCa employs two principal losses, computed jointly for each image–text pair in a batch:

Global Image–Text Contrastive Loss (InfoNCE):

$\mathcal{L}_\text{Con} = -\frac{1}{N} \sum_{i=1}^N \left[ \log \frac{\exp(x_i \cdot y_i / \tau)}{\sum_j \exp(x_i \cdot y_j / \tau)} + \log \frac{\exp(y_i \cdot x_i / \tau)}{\sum_j \exp(y_i \cdot x_j / \tau)} \right]$

where $x_i$ and $y_i$ are the normalized image and text embeddings, and $\tau$ is a learnable temperature.

Autoregressive Captioning Loss:

$\mathcal{L}_\text{Cap} = -\sum_{t=1}^{T} \log P(y_t | y_{<t}, I)$

maximizing the likelihood of the ground-truth token sequence given the image.

Joint Objective:

$\mathcal{L}_\text{CoCa} = \lambda_\text{Con} \cdot \mathcal{L}_\text{Con} + \lambda_\text{Cap} \cdot \mathcal{L}_\text{Cap}$

with standard settings $\lambda_\text{Con} = 1$ , $\lambda_\text{Cap} = 2$ during pretraining (Yu et al., 2022).

This joint optimization enforces both semantic alignment of image–text pairs and fluency/faithfulness in caption generation. Both losses are efficiently computed in a shared graph.

3. Pretraining Protocol and Foundation Model Transfer

CoCa is pretrained from scratch on billions of image–text pairs from large-scale datasets such as JFT-3B (3B labeled images converted to textual prompts) and ALIGN (~1.8B noisy web alt-text pairs). Each batch mixes samples from both distributions, and the captioning vocabulary is built with SentencePiece (64K tokens) to accommodate open-word settings (Yu et al., 2022).

Optimizer: Adafactor with decoupled weight decay (0.01), $\beta_1 = 0.9$ , $\beta_2 = 0.999$ .
LR Schedule: Warmup over 2% of 500k steps to $8 \times 10^{-4}$ , then linear decay.
Batch Size: 65,536 samples.
Hardware: 2,048 TPU-v4 chips, 5 days for pretraining (for the giant variant).

Once pretrained, CoCa’s representations can be adapted to downstream tasks either zero-shot, via frozen-feature evaluation with a learned attentional head, or with full (or parameter-efficient) finetuning (Yu et al., 2022, Narasinghe et al., 14 Dec 2025).

4. Extensions, Variants, and Practical Adaptations

Several extensions and variants have been proposed to adapt CoCa for specialized tasks, architectural improvements, and transfer to domains with data constraints:

SyCoCa: Augments CoCa with Text-Guided Masked Image Modeling (TG-MIM), adding a bidirectional local image–text pretext. Attentive masking selects patches most/least related to the caption for TG-MIM and captioning respectively. The full SyCoCa objective is:

$\mathcal{L}_\text{total} = \mathcal{L}_\text{ITC} + \lambda_{IC}\mathcal{L}_\text{IC} + \lambda_{TM}\mathcal{L}_\text{TM}$

TG-MIM enhances fine-grained alignment leading to +5–9% task improvements compared to CoCa (Ma et al., 4 Jan 2024).

GRR-CoCa: Incorporates LLM-inspired improvements—Gaussian Error Gated Linear Units (GEGLU), Root Mean Squared Normalization (RMSNorm), and Rotary Positional Embedding (RoPE)—into the ViT encoder and text decoders. GRR-CoCa achieves up to 27.25% reduction in contrastive loss, 7.15% in CoCa loss, and 5%+ gains on downstream fine-tuning relative to baseline CoCa (Patock et al., 24 Jul 2025).
Few-Shot Adaptation: CoCa supports hybrid prototype classifiers (visual / text / fusion) and LoRA-based PEFT for few-shot image classification. Empirically, low-rank adaptation with strong augmentation, and hybrid cross-entropy + supervised contrastive loss (SupCon) yield state-of-the-art robustness in data-scarce regimes. Augmentation must be tuned carefully: strong augmentation degrades linear probe, but is required to stabilize LoRA (Narasinghe et al., 14 Dec 2025).
Domain Specialization (CoCa-CXR): In medical imaging, CoCa-CXR extends the architecture with explicit temporal pipelines and a regional cross-attention module to align differences in longitudinal chest X-rays and their corresponding comparative reports, outperforming prior state-of-the-art by up to 4.8% on progression detection and generating medically accurate reports (Chen et al., 27 Feb 2025).

5. Empirical Results and Benchmarks

CoCa and its variants have established strong performance across major benchmarks:

Task/Protocol	CoCa Baseline Performance	Best Variant / Advance	Reference
ImageNet zero-shot (top-1)	86.3%	86.3% (CoCa-giant)	(Yu et al., 2022)
Frozen-feature ImageNet	90.6%	90.6%	(Yu et al., 2022)
Finetuned ImageNet	91.0% (SOTA)	91.0%	(Yu et al., 2022)
MSCOCO Captioning (CIDEr)	143.6	149.9 (+6.3%; SyCoCa)	(Yu et al., 2022, Ma et al., 4 Jan 2024)
Flickr30K Zero-shot Retrieval (R@1)	37.5/28.7 (img↔txt)	42.6/32.4 (SyCoCa; +5.1/+3.7%)	(Ma et al., 4 Jan 2024)
VQA v2 Accuracy	82.3%	91.4% (+9.1%; SyCoCa)	(Yu et al., 2022, Ma et al., 4 Jan 2024)
CXR Progression Detection	–	65.0% (CoCa-CXR; +4.8% over previous SOTA)	(Chen et al., 27 Feb 2025)
Pretrain CoCa-loss (CC12M)	3.2864	3.0516 (–7.15%; GRR-CoCa)	(Patock et al., 24 Jul 2025)

CoCa’s joint objective yields state-of-the-art or near-SOTA across visual recognition, cross-modal retrieval, captioning, and multimodal reasoning.
SyCoCa’s attentive masking and TG-MIM module improve local grounding and fine-grained retrieval/classification, with gains as high as +12.5% in retrieval when correctly paired.
GRR-CoCa demonstrates that modern LLM mechanisms reliably yield substantial improvements with no parameter overhead.
In few-shot learning, hybrid prototyping is optimal under extreme scarcity, while LoRA with hybrid loss and strong augmentation is preferred for moderately-sized support sets (Narasinghe et al., 14 Dec 2025).

6. Design Insights, Limitations, and Future Directions

The CoCa framework demonstrates that a single transformer network with multimodal decoders, joint contrastive + autoregressive objective, and task-specific poolers can subsume the functionalities of dual-encoder CLIP, encoder-decoder captioners, and vision-language understanding models. This consolidation brings several advantages:

Unified training: Both contrastive and generative skills emerge from a single training run and model graph, with minimal computational overhead.
Multi-protocol adaptation: CoCa supports zero-shot transfer, frozen-feature evaluation, and feature-finetuning with or without PEFT.
Lightweight adaptation: Task-specific poolers and parameter-efficient adaptation (e.g., LoRA) allow rapid deployment in specialized or low-resource settings.

However, certain limitations persist:

Pretraining Data Scale: CoCa depends on extremely large and diverse datasets, raising scaling, distribution, and bias concerns.
Computational Cost: Training is resource-intensive, requiring weeks on large TPU/GPU clusters for SOTA variants.
Real-world Robustness: Distribution shift and robustness to domain corruptions require deeper exploration.

Future directions include scaling to additional modalities (audio, video), leveraging domain annotation pipelines (as in CoCa-CXR for medical tasks), integrating more advanced LLM mechanisms, and exploring data-efficient or distillation-based scaling strategies (Yu et al., 2022, Ma et al., 4 Jan 2024, Patock et al., 24 Jul 2025, Narasinghe et al., 14 Dec 2025, Chen et al., 27 Feb 2025).