Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

Published 31 Mar 2026 in cs.CV and cs.AI | (2603.29258v1)

Abstract: Vision-LLMs (VLMs) have demonstrated strong capabilities across a wide range of multimodal tasks. However, recent studies have shown that VLMs, such as CLIP, perform poorly in understanding negation expressions, which are common in natural language. In this work, we propose Omni-NegCLIP, a fine-tuned CLIP model that improves CLIP's understanding of two types of negation, namely presence-based negation and absence-based negation, which correspond to negated expressions of objects that are actually present in an image and those that may plausibly exist in an image but are in fact absent, respectively, by modifying CLIP's original InfoNCE contrastive loss. Specifically, we design a presence-based contrastive objective that pulls image embeddings closer to their original caption embeddings while pushing them away from the corresponding presence-based negated caption embeddings, and an absence-based contrastive objective that aligns image embeddings with both original and absence-based negated caption embeddings while maintaining a semantic distinction between the two text embeddings. Based on our observation that the front transformer layers of CLIP text encoder have stronger learning ability for negated text than the later layers, we fine-tune the front transformer layers of the CLIP text encoder at each training step using the combined contrastive objective. Experimental results show that, compared with pretrained CLIP, Omni-NegCLIP improves performance on presence-based negation and absence-based negation tasks by up to 52.65% and 12.50%, respectively, without sacrificing general capability in image-text retrieval and even improving it by up to 19.62%. Compared with prior works, Omni-NegCLIP demonstrates a more comprehensive ability to understand multiple types of negation tasks.

Abstract PDF Upgrade to Chat

Authors (1)

Jingqi Xu

Summary

The paper introduces front-layer fine-tuning of CLIP using specialized contrastive losses for both presence-based and absence-based negation, yielding significant performance improvements.
The paper achieves up to 52.65% accuracy improvement on presence-based negation tasks while maintaining strong retrieval performance in general vision-language benchmarks.
The paper demonstrates that optimizing only the initial transformer layers enhances fine-grained syntactic modeling and negation understanding without compromising overall semantic alignment.

Omni-NegCLIP: Front-Layer Contrastive Fine-Tuning for Comprehensive Negation in Vision-LLMs

Motivation and Background

Vision-LLMs (VLMs), particularly CLIP, have become foundational in multimodal AI due to their robust image-text alignment via contrastive learning. Despite these strengths, recent evidence demonstrates that CLIP exhibits substantial failure modes regarding the understanding of linguistic negation, particularly for sentences containing "no," "not," or "without" [singh2024learn]. Negation is structurally critical for natural language semantics, and CLIP's insensitivity undermines commonsense reasoning, robustness, and downstream utility in tasks such as retrieval and T2I generation. Previous approaches, including CoN-CLIP and NegationCLIP, target only single types of negation tasks and show poor performance when generalized to complementary negation forms.

Formalization of Negation Tasks

Omni-NegCLIP addresses the lack of comprehensive negation understanding by formally defining and tackling two critical forms:

Presence-Based Negation: Negation of entities actually present in the image (e.g., an image of a cat, caption: "a cat", negated: "not a cat").
Absence-Based Negation: Negation involving plausible but absent entities (e.g., an image without a dog, caption: "a cat", negated: "no dog").

Illustrative examples from CC-Neg and NegRefCOCOg clarify the semantic distinction.

Figure 1: Example instances of presence-based and absence-based negation from CC-Neg and NegRefCOCOg benchmarks.

Contrastive Objective Design

Omni-NegCLIP extends the canonical InfoNCE objective used in CLIP by constructing custom contrastive losses for each negation form:

Presence-Based Contrastive Objective: Aligns image embeddings with true caption embeddings and penalizes similarity to their negated caption embeddings. This is achieved via a combination of image-to-caption, caption-to-image, and explicit negation discrimination losses, resulting in substantial semantic separation between image and negated caption.
Absence-Based Contrastive Objective: Aligns image embeddings equally with original and absence-based negated captions but introduces a margin-based penalty to enforce explicit semantic distinction in the text embedding space between original and negated captions.

Both objectives are computed via mini-batch triplets with dedicated loss components, optimizing the fine-grained negation discrimination and robustness.

Figure 2: Schematic of presence-based and absence-based contrastive objectives and the Omni-NegCLIP fine-tuning pipeline.

Layer-wise Negation Learning: Front-Layer Dominance

A layer-wise fine-tuning analysis reveals that the front transformer layers of CLIP's text encoder yield markedly higher accuracy for both negation types. Fine-tuning only the first $K$ layers, as opposed to the later or all layers, is empirically more effective. This is hypothesized to stem from the necessity of modeling fine-grained syntactic structures and negation scope—structures predominantly encoded in the early transformer layers [dumpala2024seeing].

Figure 3: Cross-layer accuracy profiles for presence-based and absence-based negation tasks, showing superior negation modeling in front transformer layers.

Fine-Tuning Pipeline

In practice, Omni-NegCLIP fine-tunes only the front $K$ transformer layers and the projection layer of the CLIP text encoder per training step, using a combined loss from the presence- and absence-based objectives. This strategy maximizes negation learning while maintaining general semantic alignment. Optimization uses AdamW and a margin parameter for loss separation, with batch triplets sampled from CC-Neg and OAN datasets.

Experimental Results

Omni-NegCLIP is evaluated on three architectures: ViT-B/32, ViT-B/16, and ViT-L/14. The method is compared against pretrained CLIP, CoN-CLIP, and NegationCLIP on CC-Neg (presence-based), NegRefCOCOg (absence-based), and COCO retrieval (general capability).

Presence-Based Negation: Omni-NegCLIP achieves up to 51.60% to 52.65% accuracy improvement over CLIP and up to 45.26% over NegationCLIP.
Absence-Based Negation: Improvements reach 11.81% to 12.50% over CLIP, and up to 17.07% over CoN-CLIP.
General Retrieval: In COCO retrieval, Omni-NegCLIP outperforms CLIP by up to 19.62% and maintains high accuracy compared to NegationCLIP.

Notably, Omni-NegCLIP demonstrates balanced performance: Unlike previous methods, it does not overfit to one negation form, providing excellent results on both negation benchmarks as well as general retrieval.

Ablation and Analysis

Systematic ablation studies confirm the necessity of explicit negation discrimination losses and margin penalties. Fine-tuning more than six front layers or inappropriate loss configurations reduces accuracy, emphasizing that front-layer adaptation and nuanced objectives are critical. Proper margin settings (e.g., $m=0.9$ ) are essential to enforce text embedding separation without semantic ambiguity.

Theoretical and Practical Implications

Omni-NegCLIP's architecture and objectives represent a new paradigm for enhancing multimodal models' interpretability and linguistic robustness. By successfully disentangling semantic negation across both presence and absence scenarios, it closes critical gaps in VLM reasoning. The methodology highlights the importance of front-layer transformer adaptation for modeling syntactic and semantic subtleties, suggesting future directions in hierarchical or syntactic-aware fine-tuning protocols for VLMs.

Practically, improved negation comprehension in CLIP unlocks advancement in search, question answering, and generative tasks, enabling models to better handle compositional linguistic phenomena. The results suggest robust transferability to other foundational vision-language architectures.

Conclusion

Omni-NegCLIP introduces front-layer contrastive fine-tuning combined with bespoke objectives for both presence-based and absence-based negation, demonstrably improving CLIP's negation understanding without sacrificing general capability. The findings establish front-layer transformer adaptability as crucial for fine-grained semantic modeling, paving the way for future VLM improvements in linguistic robustness and compositional reasoning.

Markdown Report Issue