RePOPE: Vision–Language Benchmark Analysis

Updated 5 February 2026

RePOPE is a rigorously re-annotated vision–language benchmark designed to evaluate object hallucination in MLLMs using consensus-based re-annotation.
Its systematic re-annotation removes ambiguous pairs and corrects errors, leading to notable shifts in model rankings and evaluation metrics.
RePOPE extends to audio-augmented testing, demonstrating that spoken queries and acoustic noise further impact MLLM performance.

RePOPE is a rigorously re-annotated and error-corrected vision–language benchmark designed to evaluate object hallucination in multimodal LLMs (MLLMs). Originally derived as a remediation of the POPE benchmark, RePOPE systematically addresses annotation inaccuracies inherent in large-scale dataset reuse by establishing consensus-based ground truth, quantifying ambiguities, and measuring the impact of annotation errors on MLLM performance and model ranking. It has catalyzed both methodological best practices for benchmark construction and the development of audio-augmented variants probing robustness under spoken queries.

1. Formal Foundations and Evaluation Metrics

Each RePOPE entry is an image–question pair requiring a binary (“Yes”/“No”) response. Let the ground-truth label space be $\{\,\text{Yes},\,\text{No}\,\}$ . For a model’s predictions and gold labels, the evaluation metrics are:

True Positives (TP): Model says “Yes”, label is “Yes”
False Positives (FP): Model says “Yes”, label is “No”
True Negatives (TN): Model says “No”, label is “No”
False Negatives (FN): Model says “No”, label is “Yes”

The principal metrics are:

Precision $P = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$
Recall $R = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$
$F_1$ score $F_1 = \frac{2 P R}{P + R}$
Accuracy $A = \frac{\mathrm{TP} + \mathrm{TN}}{\mathrm{TP}+\mathrm{FP}+\mathrm{TN}+\mathrm{FN}}$
Mean $F_1$ across negative sampling modes (random, popular, adversarial): $F_1^{\text{POPE}} = \frac13 \left( F_1^{\text{rand}} + F_1^{\text{pop}} + F_1^{\text{adv}} \right)$

RePOPE also introduces an annotation error rate, $E = \frac{\#\,\text{incorrect labels}}{\text{total}\,\#\,\text{labels}}$ . In the original POPE labels, errors manifest asymmetrically: for positives $E_\text{yes} = 9.3\%$ incorrect, $P = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$ 0 ambiguous; for negatives $P = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$ 1 incorrect, $P = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$ 2 ambiguous (Neuhaus et al., 22 Apr 2025).

2. Re-annotation Methodology and Data Curation

RePOPE’s construction follows a stringent multi-stage process:

Image and Question Selection: 500 images (from the MSCOCO val split, each with $P = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$ 3 annotated objects) are paired with six yes/no prompts per image—three affirming (objects annotated as present), three negating (objects absent, sampled via random, popular, or adversarial selection).
Independent Expert Annotation: Two expert annotators relabel all 3,000 image–question pairs as “Yes”, “No”, or “Ambiguous”, guided by consensus rules and precise object definitions. Annotators assign “Ambiguous” where MSCOCO’s class labels are underspecified or conflicting (e.g., “teddy bear” vs. “bear”).
Consensus and Disambiguation: Annotators jointly resolve labeling disagreements. All ambiguous pairs are excluded from the final benchmark. Pairs where the consensus label contradicts POPE’s original annotation are flipped.
Resulting Dataset: After removing ambiguous pairs (13.8% of positives, 4.3% of negatives), about 2,545 pairs remain. The flipping process corrects 9.3% of positives and 1.7% of negatives. Error rates among negatives vary by sampling type: random (0.3%), popular (2.6%), adversarial (2.2%), while positive errors are uniform.

3. Dataset Statistics and Structure

The table below summarizes key RePOPE dataset statistics:

Category	Original Pairs	Ambiguous Removed	Labels Flipped	Error Rate
Positives (Yes)	2 × 500 × 3	414 (13.8%)	278 (9.3%)	E_yes = 9.3%
Negatives (No)	2 × 500 × 3	65 (4.3%)	26 (1.7%)	E_no = 1.7%

Total post-processing: ~2,545 pairs.

Ambiguities and annotation errors are not uniformly distributed. A notable asymmetry is observed: positive splits contain significantly more ambiguities and mislabels, which previously inflated model TP counts and distorted F1 rankings.

4. Experimental Protocol and Model Evaluation

The evaluation protocol applies the same streamlined pipeline across state-of-the-art open-weight MLLMs, including InternVL2.5 (multiple parameter scales), Ovis2 (1–8B), LLaVA–NeXT (various base models), LLaVA–OneVision, and PaliGemma–3B/10B (Neuhaus et al., 22 Apr 2025).

For each model:

All image–question pairs are presented under all three negative-sampling schemes.
Models output binary judgments, from which all relevant metrics (TP, FP, TN, FN, precision, recall, F1, accuracy, “yes-ratio”) are derived.
Metrics and model rankings are reported both for the original and re-annotated (RePOPE) versions, allowing direct measurement of annotation error impact.
No significance tests are reported for metric differences.

5. Quantitative Results and Shifts in Model Rankings

The correction and removal of ambiguous or incorrect annotations induce substantial measurement shifts:

True Positives (TP): Drop by 10–15% across all models on RePOPE relative to POPE.
False Positives (FP): Nearly double on the random negative split; minimal changes on popular, slight decrease on adversarial.
Precision: Uniformly decreases for all models.
Recall: Increases as a consequence of correcting previously missing positives.
F1 Score and Ranking: The top-3 mean-F1 models on POPE (InternVL2.5–26B at 90.1%, Ovis2–8B at 90.0%, Ovis2–4B at 89.9%) are overtaken on RePOPE by Ovis2–4B (94.2%), InternVL2.5–78B (94.1%), and Ovis2–8B (94.1%). InternVL2.5’s smaller variants drop markedly in rank.
Hallucination Measurement: On the random split, nearly half the false positives in POPE were due to omitted ground-truth objects; thus, RePOPE reveals the original random-negative subset as an unreliable hallucination testbed.

Strong correlation is observed between the per-model decrease in yes-ratio and gain in F1, indicating that models with more cautious affirmative rates are comparatively robust post-correction.

6. Analysis, Implications, and Recommendations

The imbalance in annotation error rates ( $P = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$ 49.3% positive, $P = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$ 51.7% negative) produces first-order effects in benchmark results. Corrections cause systematic ranking shifts: models with high recall—but moderate precision—lose their advantage, while those with balanced metrics rise.

Key recommendations distilled from RePOPE’s methodology include:

Consensus-based Re-annotation: Ground-truth for all test examples should be established by consensus rather than inherited from older datasets.
Explicit Handling of Ambiguity: Labeling ambiguous or underspecified cases distinctly and removing them from final evaluation avoids artificial compression or inflation in metrics.
Benchmark Transparency: Public release of relabeled ground-truth and annotation reasoning is essential for future reproducibility.
Statistical Rigor: Statistical testing (e.g., bootstrapped CIs, paired $P = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$ 6-tests) should accompany result reporting, though not present in the original RePOPE release.
Benchmark Evolution: Even in RePOPE, the test becomes saturated at high TPR/TNR; more challenging negatives (e.g., DASH-B), and replicate annotation, are required for ongoing meaningfulness.

7. Extensions: RePOPE-Spk and Robustness to Spoken Queries

RePOPE has been extended to RePOPE-Spk, an audio-augmented benchmark that assesses hallucination rates when queries are delivered as speech under varied acoustic conditions (Park et al., 19 Sep 2025). Findings reveal:

Conversion from text to clean speech increases error rates by 2–5 percentage points; environmental noise (SNR = 0 dB) can drive errors up by 10–15 points relative to text.
Some models display marked input order sensitivity: the arrival order of image and speech impacts F1 (Gemma-3n: ΔF1 = 13.4pp).
Prompt engineering (many-shot, chain-of-thought) provides partial mitigation but does not restore text-level reliability.
Recommendations include evaluation across realistic acoustic regimes, explicit speech summarization before reasoning, and careful consideration of input-modality sequencing.

This suggests that even state-of-the-art MLLMs exhibit substantial unreliability under audio-based queries, underscoring the ongoing need for robust negative construction, consensus annotation, and multi-modal evaluation protocols.

References:

"RePOPE: Impact of Annotation Errors on the POPE Benchmark" (Neuhaus et al., 22 Apr 2025)
"Evaluating Hallucinations in Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions" (Park et al., 19 Sep 2025)

Markdown Report Issue Upgrade to Chat

References (2)

RePOPE: Impact of Annotation Errors on the POPE Benchmark (2025)

Evaluating Hallucinations in Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RePOPE.