NegBench Benchmark Evaluation

Updated 22 December 2025

NegBench is an evaluation benchmark that appraises the ability of retrieval models to process explicit negation in natural language queries using paired positive and negative captions.
It constructs diverse test cases by generating negated caption variants from datasets like COCO and MSR-VTT, and evaluates models with zero-shot protocols using metrics such as R@5 and R-Neg@5.
Results demonstrate that advanced methods like TARA and negation-specific models improve logical discrimination, setting a new standard for handling negation in multimodal retrieval tasks.

NegBench is an evaluation benchmark constructed to assess the capability of vision-language retrieval models to handle explicit negation in natural language queries. Developed in the context of image and video retrieval, its principal use is to evaluate whether retrieval models avoid selecting content that directly conflicts with a query's negation clause. NegBench was introduced in conjunction with the TARA (Time Aware Retrieval Adaptation) framework, but is designed for general use with both standard and negation-specific models (Bagad et al., 15 Dec 2025).

1. Benchmark Construction

NegBench consists of positive and explicitly negated (hard negative) captions paired with standard image and video datasets. Specifically, it uses the COCO 2017 validation set (5,000 images) and the standard MSR-VTT test split (1,000 videos). For each original human-written caption $C$ , multiple corresponding negated variants are generated by inserting negation clauses referencing "distractor" objects $X$ —objects not present in the associated image or video. Each pair thus consists of a positive caption (unaltered human annotation) and one or more negative captions such as:

“There is no dog in the image. $C$ .”
“ $C$ . There is no dog in the image.”

To diversify the linguistic forms of negation, paraphrases of both positive and negative captions are generated using LLaMA 3.1, thereby mitigating issues related to overfitting on specific templates and introducing greater lexical and syntactic variety. Example constructions include:

“A cat sits on a sofa.” (original)
“There is no cat on the sofa. A cat sits on a sofa.” (negated)
“A man is chopping vegetables.” (original)
“There is no man chopping vegetables in this clip.” (negated) (Bagad et al., 15 Dec 2025).

2. Evaluation Protocol

NegBench evaluates retrieval models using a zero-shot protocol. For each test case, all gallery images (COCO) or videos (MSR-VTT) are considered, with no separate splits for training or validation within NegBench itself. The two primary retrieval tasks are:

Text $\to$ Image retrieval (COCO val, 5,000-way)
Text $\to$ Video retrieval (MSR-VTT, 1,000-way)

Queries consist of both the positive (true) and negated (false for the ground-truth sample) captions, testing retrieval models’ sensitivity to logical mismatches, particularly negation.

3. Metrics and Scoring

Performance is measured via Recall@5 ( $R@5$ ), representing the fraction of queries for which the ground-truth item is ranked within the top 5 results:

$R@K = \frac{1}{Q} \sum_{i=1}^Q \mathbb{1}{\{ \mathrm{rank}_i \leq K \}}$

where $Q$ is the number of queries and $\mathrm{rank}_i$ is the 1-indexed position of the correct item for query $i$ .

A negation-sensitive variant, $R$ -Neg@5, is computed identically to $R@5$ but restricted to the subset of queries containing explicit negations. This metric directly measures a model’s ability to reject matched images or videos when the textual query includes a negation about a non-existent object/action.

4. Baseline Models and Fine-Tuning Variants

Several representative baseline models are included in NegBench evaluations:

Model	Fine-tune Data	Purpose/Type
CLIP (ViT-B/32)	None/CC12M/NegCap/etc	Foundation model, with and without negation exposure
NegCLIP	Same variants	Explicitly negation-sensitive model
Tarsier-7B	None	TARA base MLLM
Tarsier-7B + TARA	Ours (no video)	TARA text-only adaptation

“NegCLIP” refers to a model explicitly designed to address negation [Alhamoud et al. 2025]; CLIP and Tarsier-7B provide foundation model baselines. Fine-tuning regimes include large-scale datasets (CC12M), as well as negation-augmented variants: CC12M+NegCap (caption-level negation) and CC12M+NegFull (fully synthetic negations).

5. Quantitative Results

Comprehensive results are presented for both COCO and MSR-VTT. Table A below (R@5, R-Neg@5) summarizes core findings.

Model	Fine-tune data	COCO R@5	COCO R-Neg@5	MSR-VTT R@5	MSR-VTT R-Neg@5
CLIP	None	54.8	48.0	50.6	45.8
CLIP	CC12M	58.8	54.5	53.7	49.9
CLIP	CC12M + NegCap	58.5	57.8	54.1	53.5
CLIP	CC12M + NegFull	54.2	51.9	46.9	43.9
NegCLIP	None	68.7	64.4	53.7	51.0
NegCLIP	CC12M	70.2	66.0	56.4	52.6
NegCLIP	CC12M + NegCap	68.6	67.5	56.5	54.6
NegCLIP	CC12M + NegFull	69.0	67.0	54.0	51.5
Tarsier-7B	None	57.4	45.6	55.7	49.7
Tarsier-7B + TARA	Ours (no video)	72.6	68.7	69.0	68.7

TARA (zero-shot) achieves large improvements on both general ( $R@5$ ) and negation-sensitive ( $R$ -Neg@5) scores relative to all other models. For example, on COCO validation, $R@5$ improves from 57.4 to 72.6 (+15.2 points) and $R$ -Neg@5 from 45.6 to 68.7 (+23.1 points) compared to the Tarsier-7B base model. Negation-specific models like NegCLIP improve over CLIP by up to 10 points on $R$ -Neg@5 but fall 2–4 points behind TARA. No run-to-run significance intervals are reported, but reported deltas exceed typical variance ( $\pm$ 0.5 points in other tasks).

6. Qualitative Behavior and Interpretations

Qualitative analysis highlights that, for negated queries (e.g., “There is no dog in this image”), standard models often include forbidden items (e.g., images containing dogs) in the top 5 retrieved results. In contrast, TARA robustly demotes mismatched items below this threshold, aligning with its gains in $R$ -Neg@5. Notably, despite TARA using only text-triplet contrastive adaptation (with no vision or explicit negation data) during fine-tuning, it learns to respect negation—a qualitative property not observed in baseline representations. The authors argue that training on text triplets with a contrastive objective sharpens sensitivity to any logical query–gallery mismatch, including but not limited to negation. However, no formal ablation is reported on NegBench itself, and no per-query error analyses or qualitative retrieval maps are shown (Bagad et al., 15 Dec 2025).

7. Significance, Limitations, and Applications

NegBench fills a crucial gap by providing an explicit testbed for negation in retrieval. By augmenting standard datasets with systematic negation and paraphrase, NegBench isolates the ability of models to process logical exclusion—a common failure mode in vision-language systems. The most significant finding is that TARA, without access to vision data or explicit negations during adaptation, surpasses strong vision-language and negation-specific baselines. This suggests that text-only contrastive adaptation is capable of imparting logical discrimination, including negation sensitivity, to multimodal representations.

A plausible implication is that further improvements may be achieved by enriching text triplets with other forms of logical structure, though this remains to be demonstrated. Limitations include the absence of ablations on NegBench and no statistical testing of result stability. NegBench’s utility is as a pure evaluation suite; it does not supply new training data, and its methodology is dependent on the integrity of “distractor” object mining and effective paraphrase generation.

NegBench is a robust and extensible evaluation resource for the field of multimodal retrieval, directly enabling the benchmark and comparison of negation-aware capabilities in both generalist and specialized architectures (Bagad et al., 15 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

TARA: Simple and Efficient Time Aware Retrieval Adaptation of MLLMs for Video Understanding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NegBench Benchmark.