VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation (2505.23484v1)

Published 29 May 2025 in cs.CV

Abstract: Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-LLMs (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging LLM to verify caption quality via contrastive QA-pairs analysis. By providing actionable insights for caption optimization, our benchmark can advance the development of robust text-to-video models. The dataset and codes are available at website: https://github.com/GXYM/VCapsBench.

Authors (6)

Shi-Xue Zhang (12 papers)
Hongfa Wang (29 papers)
Duojun Huang (6 papers)
Xin Li (980 papers)
Xiaobin Zhu (21 papers)
Xu-Cheng Yin (35 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GitHub

GitHub - GXYM/VCapsBench: VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation (1 star)

VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation (2505.23484v1)

Summary

Related Papers

GitHub