Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 82 tok/s

Gemini 2.5 Pro 61 tok/s Pro

GPT-5 Medium 35 tok/s Pro

GPT-5 High 36 tok/s Pro

GPT-4o 129 tok/s Pro

Kimi K2 212 tok/s Pro

GPT OSS 120B 474 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

RePOPE: Impact of Annotation Errors on the POPE Benchmark (2504.15707v1)

Published 22 Apr 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Since data annotation is costly, benchmark datasets often incorporate labels from established image datasets. In this work, we assess the impact of label errors in MSCOCO on the frequently used object hallucination benchmark POPE. We re-annotate the benchmark images and identify an imbalance in annotation errors across different subsets. Evaluating multiple models on the revised labels, which we denote as RePOPE, we observe notable shifts in model rankings, highlighting the impact of label quality. Code and data are available at https://github.com/YanNeu/RePOPE .

Summary

The paper introduces RePOPE, a re-annotation of POPE that corrects MSCOCO errors to better assess object hallucination in vision-language models.
It employs a consensus-based labeling method (Yes, No, Ambiguous) revealing significant changes in F1 scores and model rankings.
Results emphasize that improved label quality markedly affects model performance, underscoring the need for meticulous dataset curation.

Overview of the Paper "RePOPE: Impact of Annotation Errors on the POPE Benchmark" (2504.15707)

The paper introduces RePOPE, a revised label set for the frequently used object hallucination benchmark, POPE, and evaluates the impact of annotation errors from the MSCOCO dataset on model rankings. By re-annotating benchmark images, this research identifies significant shifts in model rankings when using RePOPE, emphasizing the importance of label quality in performance evaluation.

Introduction to POPE and RePOPE

The POPE benchmark is commonly utilized in evaluating object hallucinations of Vision-LLMs (VLMs). It involves a binary classification task where models are asked if a specified object is present in an image. The benchmark is primarily based on 500 images from the MSCOCO dataset, recognized for exhaustive annotation of 80 object classes but known to contain annotation errors. The paper introduces RePOPE, which re-annotates these images to correct label errors and assesses the impact of these corrections on model performance.

Figure 1: RePOPE annotation examples demonstrating errors and inconsistencies in the original POPE labels.

Methodology

Construction of RePOPE

RePOPE is constructed by re-annotating all images from POPE, assigning labels of "Yes", "No", or "Ambiguous", relying on consensus from two human labelers. The ambiguity arises in cases where object presence is subjective or inconsistently annotated in MSCOCO, leading to significant revisions in the dataset.

Evaluation Approach

Models are evaluated on both POPE and the revised RePOPE label sets. This allows a comparative analysis of the effect of annotation errors. The evaluation considers different subsets of data: random, popular, and adversarial, based on how non-annotated objects are selected.

Experimental Results

The findings reveal notable changes in performance metrics, particularly in F1 scores and ranking of models when evaluated on RePOPE versus POPE:

True Positives (TP) and False Positives (FP): Re-labeling substantially decreased TPs across all models while showing varying effects on FPs, especially in the adversarial subset where FPs decreased slightly due to a higher prevalence of true object presence errors (Figure 2).

Figure 2: POPE vs. RePOPE displaying significant reductions in TP and variable patterns in FP across subsets.

Precision, Recall, and Accuracy: Precision generally decreased on RePOPE while recall improved. These metrics exhibited enough variability in their impact to alter model rankings based on an F1 scoring criterion. Accuracy is less interpretable due to unbalanced positive and negative samples in RePOPE (Figure 3).
Figure 3: POPE vs. RePOPE showing alterations in precision and recall that affect model rankings.

Implications of the Research

The paper highlights the critical role of accurate data labeling in benchmarking VLMs and detecting object hallucinations. The introduction of RePOPE offers a more reliable means to assess model vulnerabilities and suggests that reliance on datasets like POPE might lead to misleading evaluations due to annotation errors.

The re-annotated dataset impacts benchmark saturation and suggests evaluating additional benchmarks such as DASH-B for more comprehensive assessments.

Conclusion

RePOPE corrects annotation errors within POPE, extensively affecting the F1 rankings of evaluated models. This reveals the dependency of VLM performance evaluations on the quality of dataset annotations, thereby necessitating improvements in dataset curation. Subsequent studies are encouraged to build upon this revised benchmarking methodology to ensure more robust model evaluations.