CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era (2503.12329v1)

Published 16 Mar 2025 in cs.CV and cs.CL

Abstract: Image captioning has been a longstanding challenge in vision-language research. With the rise of LLMs, modern Vision-LLMs (VLMs) generate detailed and comprehensive image descriptions. However, benchmarking the quality of such captions remains unresolved. This paper addresses two key questions: (1) How well do current VLMs actually perform on image captioning, particularly compared to humans? We built CapArena, a platform with over 6000 pairwise caption battles and high-quality human preference votes. Our arena-style evaluation marks a milestone, showing that leading models like GPT-4o achieve or even surpass human performance, while most open-source models lag behind. (2) Can automated metrics reliably assess detailed caption quality? Using human annotations from CapArena, we evaluate traditional and recent captioning metrics, as well as VLM-as-a-Judge. Our analysis reveals that while some metrics (e.g., METEOR) show decent caption-level agreement with humans, their systematic biases lead to inconsistencies in model ranking. In contrast, VLM-as-a-Judge demonstrates robust discernment at both the caption and model levels. Building on these insights, we release CapArena-Auto, an accurate and efficient automated benchmark for detailed captioning, achieving 94.3% correlation with human rankings at just $4 per test. Data and resources will be open-sourced at https://caparena.github.io.

Summary

CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

The advent of LLMs has significantly advanced the capabilities of vision-LLMs (VLMs), notably in generating detailed image captions. This paper, titled "CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era," addresses critical gaps in evaluating the efficacy of VLMs for image captioning, which has historically been a challenge within the realms of computer vision and natural language processing. The authors introduce CapArena, an innovative framework designed to scrutinize the performance of VLMs in generating nuanced and comprehensive image descriptions and to benchmark these systems against human capabilities.

Key Insights and Contributions

This research is centered around two pivotal inquiries: firstly, the actual performance level of current VLMs in image captioning compared with human benchmarks, and secondly, the reliability of automated metrics in assessing detailed caption quality.

Performance Benchmarking with CapArena:
- CapArena is a platform featuring over 6000 pairwise caption comparisons, complete with high-quality human preference evaluations. This approach enables a robust analysis of the performance of prominent models such as GPT-4o, providing evidence that such models achieve or sometimes even surpass human-level performance in generating detailed image captions.
- The results exhibit a marked divergence between the performance of advanced commercial models and most open-source alternatives, which, despite their prevalence, do not match the capabilities of the former in detailed image captioning tasks.
Evaluation of Captioning Metrics:
- Using human annotations as a benchmark, the paper comprehensively evaluates various traditional and novel captioning metrics, including METEOR and VLM-as-a-Judge.
- The analysis highlights significant systematic biases in several traditional metrics, which misalign with human judgment. In contrast, using VLM-as-a-Judge shows a stronger agreement with human evaluations at both the caption and model levels.
- The authors propose CapArena-Auto, an economical and efficient automated benchmark that demonstrates a 94.3% correlation with human rankings, costing significantly less than manual evaluations.

Implications and Future Directions

This work presents a significant advancement in the evaluation methodology for image captioning. By thoroughly examining the capabilities of VLMs and how they measure up against human-generated captions, this research provides a foundational step towards understanding and improving image captioning technologies. The implications extend beyond mere benchmarking; they offer valuable insights into the strengths and limitations of existing models, thereby guiding further development efforts in the AI community.

The paper conjectures future developments in AI could focus on narrowing the gap between open-source and commercial models by enhancing the visual perception components within VLMs, as exemplified by the performance of InternVL2-26B. Additionally, refining VLM capabilities in handling diverse and intricate visual scenes will be crucial for further progress.

In conclusion, "CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era" provides a comprehensive framework for evaluating image captioning performance, establishing a new standard for assessing VLM capabilities and paving the way for future improvements in vision-language integration.

Related Papers

GitHub

Tweets

https://twitter.com/njucckevin/status/1902547937051467956