Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 61 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 129 tok/s Pro
Kimi K2 212 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

RePOPE: Impact of Annotation Errors on the POPE Benchmark (2504.15707v1)

Published 22 Apr 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Since data annotation is costly, benchmark datasets often incorporate labels from established image datasets. In this work, we assess the impact of label errors in MSCOCO on the frequently used object hallucination benchmark POPE. We re-annotate the benchmark images and identify an imbalance in annotation errors across different subsets. Evaluating multiple models on the revised labels, which we denote as RePOPE, we observe notable shifts in model rankings, highlighting the impact of label quality. Code and data are available at https://github.com/YanNeu/RePOPE .

Summary

  • The paper introduces RePOPE, a re-annotation of POPE that corrects MSCOCO errors to better assess object hallucination in vision-language models.
  • It employs a consensus-based labeling method (Yes, No, Ambiguous) revealing significant changes in F1 scores and model rankings.
  • Results emphasize that improved label quality markedly affects model performance, underscoring the need for meticulous dataset curation.

Overview of the Paper "RePOPE: Impact of Annotation Errors on the POPE Benchmark" (2504.15707)

The paper introduces RePOPE, a revised label set for the frequently used object hallucination benchmark, POPE, and evaluates the impact of annotation errors from the MSCOCO dataset on model rankings. By re-annotating benchmark images, this research identifies significant shifts in model rankings when using RePOPE, emphasizing the importance of label quality in performance evaluation.

Introduction to POPE and RePOPE

The POPE benchmark is commonly utilized in evaluating object hallucinations of Vision-LLMs (VLMs). It involves a binary classification task where models are asked if a specified object is present in an image. The benchmark is primarily based on 500 images from the MSCOCO dataset, recognized for exhaustive annotation of 80 object classes but known to contain annotation errors. The paper introduces RePOPE, which re-annotates these images to correct label errors and assesses the impact of these corrections on model performance. Figure 1

Figure 1

Figure 1

Figure 1: RePOPE annotation examples demonstrating errors and inconsistencies in the original POPE labels.

Methodology

Construction of RePOPE

RePOPE is constructed by re-annotating all images from POPE, assigning labels of "Yes", "No", or "Ambiguous", relying on consensus from two human labelers. The ambiguity arises in cases where object presence is subjective or inconsistently annotated in MSCOCO, leading to significant revisions in the dataset.

Evaluation Approach

Models are evaluated on both POPE and the revised RePOPE label sets. This allows a comparative analysis of the effect of annotation errors. The evaluation considers different subsets of data: random, popular, and adversarial, based on how non-annotated objects are selected.

Experimental Results

The findings reveal notable changes in performance metrics, particularly in F1 scores and ranking of models when evaluated on RePOPE versus POPE:

  • True Positives (TP) and False Positives (FP): Re-labeling substantially decreased TPs across all models while showing varying effects on FPs, especially in the adversarial subset where FPs decreased slightly due to a higher prevalence of true object presence errors (Figure 2). Figure 2

Figure 2

Figure 2: POPE vs. RePOPE displaying significant reductions in TP and variable patterns in FP across subsets.

  • Precision, Recall, and Accuracy: Precision generally decreased on RePOPE while recall improved. These metrics exhibited enough variability in their impact to alter model rankings based on an F1 scoring criterion. Accuracy is less interpretable due to unbalanced positive and negative samples in RePOPE (Figure 3). Figure 3

    Figure 3: POPE vs. RePOPE showing alterations in precision and recall that affect model rankings.

Implications of the Research

The paper highlights the critical role of accurate data labeling in benchmarking VLMs and detecting object hallucinations. The introduction of RePOPE offers a more reliable means to assess model vulnerabilities and suggests that reliance on datasets like POPE might lead to misleading evaluations due to annotation errors.

The re-annotated dataset impacts benchmark saturation and suggests evaluating additional benchmarks such as DASH-B for more comprehensive assessments.

Conclusion

RePOPE corrects annotation errors within POPE, extensively affecting the F1 rankings of evaluated models. This reveals the dependency of VLM performance evaluations on the quality of dataset annotations, thereby necessitating improvements in dataset curation. Subsequent studies are encouraged to build upon this revised benchmarking methodology to ensure more robust model evaluations. Figure 4

Figure 4: POPE vs. RePOPE highlighting how accuracy measurements need careful interpretation due to the imbalance in RePOPE.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 posts and received 11 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube