CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness (2502.14914v3)
Abstract: Visual captioning benchmarks have become outdated with the emergence of modern multimodal LLMs (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent benchmarks attempt to address this by focusing on keyword extraction or object-centric evaluation, they remain limited to vague-view or object-view analyses and incomplete visual element coverage. In this paper, we introduce CAPability, a comprehensive multi-view benchmark for evaluating visual captioning across 12 dimensions spanning six critical views. We curate nearly 11K human-annotated images and videos with visual element annotations to evaluate the generated captions. CAPability stably assesses both the correctness and thoroughness of captions with \textit{precision} and \textit{hit} metrics. By converting annotations to QA pairs, we further introduce a heuristic metric, \textit{know but cannot tell} ($K\bar{T}$), indicating a significant performance gap between QA and caption capabilities. Our work provides a holistic analysis of MLLMs' captioning abilities, as we identify their strengths and weaknesses across various dimensions, guiding future research to enhance specific aspects of their capabilities.
- Zhihang Liu (9 papers)
- Chen-Wei Xie (14 papers)
- Bin Wen (34 papers)
- Feiwu Yu (3 papers)
- Jixuan Chen (9 papers)
- Pandeng Li (10 papers)
- Boqiang Zhang (11 papers)
- Nianzu Yang (7 papers)
- Yinglu Li (6 papers)
- Zuan Gao (4 papers)
- Yun Zheng (49 papers)
- Hongtao Xie (48 papers)