- The paper identifies that compositional failures in VLMs arise from the global cosine similarity inference paradigm rather than from representational limitations.
- It introduces a structure-guided inference protocol that leverages fine-grained region–token alignments to significantly boost performance on compositional benchmarks.
- The study demonstrates that applying local alignment via a lightweight transformer on frozen embeddings improves robustness under distribution shifts without full model retraining.
Revisiting Compositionality in Dual-Encoder Vision-LLMs: The Role of Inference
Introduction
This paper presents a rigorous analysis of compositional reasoning failures in dual-encoder vision-LLMs (VLMs), particularly challenging existing assumptions that these limitations arise solely from insufficient representational capacity. Instead, it provides extensive experimental evidence indicating that the standard global pooling and cosine similarity-based inference paradigm critically bottlenecks compositional generalization, even in strong pretrained systems. The authors conduct controlled diagnostic studies and introduce a lightweight architecture to learn structured alignment at inference, thereby disentangling representational power from alignment mechanisms within the dual-encoder framework.
Background and Motivation
Dual-encoder VLMs, such as CLIP and its successors, independently encode images and texts into global embeddings via vision and text transformers. At inference, image–text similarity is estimated via global cosine similarity of pooled representations. While highly effective for image–text retrieval and standard zero-shot tasks, this approach disregards local region–token interactions, which are essential for compositional reasoning—distinguishing, e.g., "a black dog and a white cat" from "a black cat and a white dog." Benchmarking on compositionality evaluation datasets routinely exposes systematic failures, with dual-encoder VLMs behaving like bag-of-words models under minimal compositional perturbations. Notably, the paper identifies a structural misalignment between the localized nature of compositional binding (object–attribute–relation tuples) and the VLMs' global matching protocol.
Diagnostic Evaluation of Inference Protocols
The authors design BiSCoR-Ctrl, an out-of-distribution, fully controlled compositional reasoning benchmark derived from CLEVR scenes, focusing exclusively on minimal "Swap" contrasts (object–attribute bindings only)—eliminating language priors and dataset artifacts. This enables a precise evaluation of vision-language compositional binding independent of confounds.
They then introduce a "Structure-Guided Inference" (SGI) protocol over frozen dual-encoder backbones: images are decomposed into region crops, texts into phrases (attributes/objects/relations), and each text segment is matched to the most similar image crop (cosine similarity), aggregating over the maximal correspondences per segment.
Figure 2: Example showing that fine-grained text segmenting yields more segments than coarse-grained segmenting, supporting detailed region–token alignment.
Experiments across CLIP, SigLIP 2, and Perception Encoder demonstrate SGI delivers dramatic gains in compositional benchmarks without any modification or update to pretrained weights. On BiSCoR-Ctrl, group scores increase from <10 (global cosine similarity) to >50 for state-of-the-art VLMs when employing SGI, especially in categories such as color and material. This establishes that failures on global pooling benchmarks reflect inference protocol deficiencies rather than a lack of compositional knowledge in the representations per se.
Learning Localized Alignment over Frozen Embeddings
The study further investigates whether alignment, rather than explicit region–segment matching, can be learned directly from frozen, pretrained patch and token representations. The authors introduce a lightweight transformer ("Alignment Transformer") operating on concatenated patch and token sequences, leveraging cross-modal self-attention but keeping VLM encoders fixed.
Comparative ablations include:
- Transformer over global pooled embeddings only (matching increased capacity but lacking locality).
- Full fine-tuning of VLMs end-to-end on retrieval datasets.
- Various compositional training methods (fine-grained alignment supervision, hard negative mining).
The alignment transformer is trained with a contrastive retrieval objective (matching image–text pairs vs. mismatches) on COCO and a hard-negative-augmented dataset (TROHN-Img). Evaluation is performed in-domain (SugarCrepe, BiVLC) and out-of-domain (BiSCoR-Ctrl).
Results and Analysis
In-domain retrieval accuracy is robustly improved by both full fine-tuning and by transformers operating on frozen local (patch/token) embeddings, the latter matching or surpassing traditional fine-tuning. However, under distribution shift (BiSCoR-Ctrl), full fine-tuning and global-transformer variants fail to improve compositional generalization, recapitulating the failures of naively pooled models.
In stark contrast, the alignment transformer trained on frozen local features achieves up to a fourfold increase in group score on BiSCoR-Ctrl (e.g., from 8.5 to 30.0 for PE backbone), while global-transformers and compositional training methods yield negligible improvements. Notably, the hardest compositional subset ("Swap") shows amplified gains, and results are robust to automatic text segmentation (via SpaCy), indicating improvements stem from leveraging existing fine-grained structure captured by frozen VLM backbones.
Additionally, the effect of training data indicates that standard COCO, despite lacking hard negatives, often yields stronger out-of-domain compositionality than more aggressively augmented datasets, suggesting noise and shortcut induction can degrade alignment robustness.
Implications for Compositionality and Inference in VLMs
These findings indicate that compositional failures in modern dual-encoder VLMs are fundamentally attributable to the global similarity matching paradigm at inference, not to insufficiently compositional representations. In fact, substantial localized alignment information is already present in patch and token embeddings, but only becomes useful for compositional reasoning via principled alignment mechanisms at inference.
These results call into question evaluations and compositionality metrics that rely on pooled similarity and underscore the necessity of modeling region–segment or token-level interactions for accurate vision–language reasoning. The systematic superiority of inference-time alignment over massive backbone fine-tuning has critical implications for model interpretability, efficiency, and deployment in distribution-shifted regimes.
From a practical standpoint, the approach facilitates highly compositional behavior in current VLMs without expensive retraining. Theoretically, it motivates new architectures and training regimes that explicitly encode and exploit fine-grained alignment for robust vision–language compositionality.
Directions for Future Work
The work points out several opportunities for further advancement:
- Extending evaluation to real-world compositional reasoning benchmarks beyond BiSCoR-Ctrl, addressing natural images and unconstrained language.
- Investigating cross-modal alignment modules that benefit from, but do not exclusively depend on, fixed pretrained representations—potentially combining alignment learning with more localized supervision.
- Applying similar protocols to generative modeling tasks, e.g., text-to-image or image captioning, where compositional faithfulness and interpretability remain unsolved challenges.
- Exploring scalable alignment architectures and efficient inference mechanisms to enable deployment in large-scale or real-time multimodal systems.
Conclusion
This paper establishes, through controlled experiments and model variants, that dual-encoder VLMs' compositional reasoning failures primarily reflect global pooling and similarity mechanisms at inference rather than deficiencies in representational structure. Explicitly enforcing or learning region–segment alignment over fixed patch and token embeddings produces robust improvements, including under challenging distribution shifts, where existing architectural and training innovations offer little benefit. This underscores the need to revisit inference design in multimodal architectures for both reliable compositionality and generalization (2604.11496).