VDC-Agent: Evolving Video Captioners
- VDC-Agent is a closed-loop framework that uses agentic self-reflection and principle-guided evaluation to autonomously enhance video caption quality.
- It constructs the VDC-Agent-19K dataset from unlabeled videos through pairwise preference extraction, enabling curriculum-based direct preference optimization.
- Quantitative results show that VDC-Agent-7B outperforms baseline models with 49.08% accuracy and a 2.50 mean score on the VDC benchmark.
VDC-Agent refers to a closed-loop, self-evolving framework for Video Detailed Captioning (VDC) that leverages agentic self-reflection, curriculum-based preference optimization, and principle-guided feedback to autonomously refine multimodal LLMs for high-fidelity video caption generation. The VDC-Agent paradigm is instantiated in the system introduced by "VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection" (Wang et al., 24 Nov 2025), which establishes new methodologies for self-improvement and benchmarking in video captioning at scale, without requiring human annotations or teacher models.
1. Agentic Iterative Refinement Framework
The core innovation of VDC-Agent is its agentic closed-loop process that enables a captioner to iteratively improve itself without external supervision. The system cycles through the following sequence for each unlabeled video:
- Caption Generation: At time step , the agent uses an underlying MLLM (Qwen2.5-VL-7B-Instruct) with prompt to generate candidate caption .
- Principle-Guided Evaluation: The system evaluates using a predefined set of textual principles (e.g., referencing camera motion, spatial arrangement, object interactions, temporal structure), producing a quality score and suggestion for improvement.
- Prompt Update: If improves or meets the target threshold (default ), the prompt is updated using the suggestion. If refinement leads to regression (), the system triggers a self-reflection module that inspects the prior chain-of-thought to amend the prompt and avoid cyclic degradations.
- Termination: The loop continues for up to rounds (default ) or until the score threshold is reached.
This process yields a trajectory of caption/score pairs per video, with .
2. Autonomous Dataset Construction: VDC-Agent-19K
VDC-Agent constructs its own training data from unlabeled sources without human annotations. The data generation pipeline is as follows:
- Input Pool: 4,008 high-resolution, unlabeled videos from the Cockatiel-4K corpus.
- Task Dimensions: For each video, the agentic loop is run under five evaluation dimensions: camera, short, background, main-object, and detailed, yielding trajectories for each.
- Trajectory Filtering: Trajectories with only a single caption (no refinement necessary) or with JSON parsing errors are removed.
- Pairwise Preferences: For each valid trajectory, the best and worst captions (by score) are extracted as with as a difficulty measure. This yields the VDC-Agent-19K dataset of 18,886 preference pairs, providing graded supervisory signals for subsequent model optimization.
3. Curriculum-Based Direct Preference Optimization (DPO)
VDC-Agent advances from agentic data to model update via curriculum-guided DPO. The training methodology is:
- Loss Function: For each preference tuple , DPO optimizes the policy to maximize preference for over relative to a fixed reference policy . The DPO loss is:
where is the logistic sigmoid and controls the KL regularization strength.
- Curriculum Ordering: Preference examples are sorted by decreasing (easy-to-hard), so large-contrast examples dominate early mini-batches, facilitating rapid convergence and decreasing gradient variance. Difficult cases refine discriminative capacity in later phases.
- Tuning Regime: The backbone model (Qwen2.5-VL-7B-Instruct) is fine-tuned with LoRA adapters (rank=16, alpha=32, dropout=0.1). Optimization uses AdamW and a cosine-decaying learning rate.
4. Principle-Guided Scoring and Prompt Self-Reflection
VDC-Agent's autonomous evaluation relies on an internal scorer, parameterized by a compact set of caption quality principles. These principles encode requirements such as reference to camera motion, scene layout, salience of main objects, background details, and temporal transitions.
- Prompt Refinement: After every scoring iteration, the system combines suggestion with fixed instruction templates to update the prompting strategy. If performance regresses, the system's self-reflection path activates: the prompt update mechanism receives both the degraded output and the prior chain-of-thought, supporting prompt inputs conditioned on failure explanations. This generative chain-of-thought self-diagnosis distinguishes VDC-Agent from static or purely externally-supervised schemes.
5. Evaluation Methodology and Quantitative Results
The final model (VDC-Agent-7B) is evaluated on the public VDC benchmark, which contains 1,027 test videos labeled for five granular dimensions via human-aligned annotation.
- Caption Quality Measurement: For each dimension, captions are scored by both accuracy and a mean rating (1–5 scale) from a held-out scorer.
- Aggregate Metrics: Primary metrics are average accuracy () and average score () across five dimensions.
- Comparative Performance: VDC-Agent-7B achieves accuracy and $2.50$ mean score, exceeding both its base (Qwen2.5-VL-7B-Instruct: ) and specialized captioners such as AVC-DPO-7B () and OwlCap-7B ().
Inference Efficiency: Model inference is a single forward pass (mean /video, 1×A800 GPU); by contrast, test-time agentic refinement would require /video. VDC-Agent internalizes the benefits of agentic self-improvement into a conventional, low-cost inference pipeline (Wang et al., 24 Nov 2025).
| Model | Accuracy (avg) | Score (avg) |
|---|---|---|
| Qwen2.5-VL-7B-Instruct | 43.95 | 2.23 |
| VDC-Agent-7B | 49.08 | 2.50 |
| AVC-DPO-7B | 47.70 | 2.47 |
| OwlCap-7B | 46.90 | 2.40 |
6. Broader Implications and Limitations
VDC-Agent demonstrates that a self-evolving paradigm—combining agentic chains of generation, principle-guided scoring, and reflective prompt updates—can construct high-quality preference data at scale and leverage curriculum DPO for effective model alignment. The approach requires neither human labels nor external teacher models and is robust to noisy self-generated data through a combination of filtering and contrastive preference pairing.
A plausible implication is that this framework generalizes to other multimodal generative settings where clean labels are scarce but principle-based evaluation is viable. However, the effectiveness of VDC-Agent depends on the validity and expressiveness of its principle set, the internal consistency of automatic scoring, and the capacity of the underlying MLLM.
7. References
- "VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection" (Wang et al., 24 Nov 2025)