VDC-Agent: Evolving Video Captioners

Updated 25 November 2025

VDC-Agent is a closed-loop framework that uses agentic self-reflection and principle-guided evaluation to autonomously enhance video caption quality.
It constructs the VDC-Agent-19K dataset from unlabeled videos through pairwise preference extraction, enabling curriculum-based direct preference optimization.
Quantitative results show that VDC-Agent-7B outperforms baseline models with 49.08% accuracy and a 2.50 mean score on the VDC benchmark.

VDC-Agent refers to a closed-loop, self-evolving framework for Video Detailed Captioning (VDC) that leverages agentic self-reflection, curriculum-based preference optimization, and principle-guided feedback to autonomously refine multimodal LLMs for high-fidelity video caption generation. The VDC-Agent paradigm is instantiated in the system introduced by "VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection" (Wang et al., 24 Nov 2025), which establishes new methodologies for self-improvement and benchmarking in video captioning at scale, without requiring human annotations or teacher models.

The core innovation of VDC-Agent is its agentic closed-loop process that enables a captioner to iteratively improve itself without external supervision. The system cycles through the following sequence for each unlabeled video:

Caption Generation: At time step $t$ , the agent uses an underlying MLLM (Qwen2.5-VL-7B-Instruct) with prompt $p_t$ to generate candidate caption $y_t$ .
Principle-Guided Evaluation: The system evaluates $y_t$ using a predefined set of textual principles $R$ (e.g., referencing camera motion, spatial arrangement, object interactions, temporal structure), producing a quality score $s_t \in [0,100]$ and suggestion $g_t$ for improvement.
Prompt Update: If $s_t$ improves or meets the target threshold $\lambda$ (default $\lambda=90$ ), the prompt is updated using the suggestion. If refinement leads to regression ( $s_t < s_{t-1}$ ), the system triggers a self-reflection module that inspects the prior chain-of-thought to amend the prompt and avoid cyclic degradations.
Termination: The loop continues for up to $T$ rounds (default $T=4$ ) or until the score threshold is reached.

This process yields a trajectory $P(x) = \{(y_0, s_0), \ldots, (y_{T_v}, s_{T_v})\}$ of caption/score pairs per video, with $T_v \leq T$ .

2. Autonomous Dataset Construction: VDC-Agent-19K

VDC-Agent constructs its own training data from unlabeled sources without human annotations. The data generation pipeline is as follows:

Input Pool: 4,008 high-resolution, unlabeled videos from the Cockatiel-4K corpus.
Task Dimensions: For each video, the agentic loop is run under five evaluation dimensions: camera, short, background, main-object, and detailed, yielding trajectories for each.
Trajectory Filtering: Trajectories with only a single caption (no refinement necessary) or with JSON parsing errors are removed.
Pairwise Preferences: For each valid trajectory, the best and worst captions (by score) are extracted as $(x, y^+, y^-, \Delta s)$ with $\Delta s = s^+ - s^-$ as a difficulty measure. This yields the VDC-Agent-19K dataset of 18,886 preference pairs, providing graded supervisory signals for subsequent model optimization.

3. Curriculum-Based Direct Preference Optimization (DPO)

VDC-Agent advances from agentic data to model update via curriculum-guided DPO. The training methodology is:

Loss Function: For each preference tuple $(x, y^+, y^-)$ , DPO optimizes the policy $\pi_\theta$ to maximize preference for $y^+$ over $y^-$ relative to a fixed reference policy $\pi_{\rm ref}$ . The DPO loss is:

$L_{\rm DPO}(\theta; x, y^+, y^-) = -\log \sigma \left[ \beta \cdot \left( \log \frac{\pi_\theta(y^+|x)}{\pi_\theta(y^-|x)} - \log \frac{\pi_{\rm ref}(y^+|x)}{\pi_{\rm ref}(y^-|x)} \right) \right],$

where $\sigma$ is the logistic sigmoid and $\beta$ controls the KL regularization strength.

Curriculum Ordering: Preference examples are sorted by decreasing $\Delta s$ (easy-to-hard), so large-contrast examples dominate early mini-batches, facilitating rapid convergence and decreasing gradient variance. Difficult cases refine discriminative capacity in later phases.
Tuning Regime: The backbone model (Qwen2.5-VL-7B-Instruct) is fine-tuned with LoRA adapters (rank=16, alpha=32, dropout=0.1). Optimization uses AdamW and a cosine-decaying learning rate.

4. Principle-Guided Scoring and Prompt Self-Reflection

VDC-Agent's autonomous evaluation relies on an internal scorer, parameterized by a compact set of caption quality principles. These principles encode requirements such as reference to camera motion, scene layout, salience of main objects, background details, and temporal transitions.

Prompt Refinement: After every scoring iteration, the system combines suggestion $g_t$ with fixed instruction templates to update the prompting strategy. If performance regresses, the system's self-reflection path activates: the prompt update mechanism receives both the degraded output and the prior chain-of-thought, supporting prompt inputs conditioned on failure explanations. This generative chain-of-thought self-diagnosis distinguishes VDC-Agent from static or purely externally-supervised schemes.

5. Evaluation Methodology and Quantitative Results

The final model (VDC-Agent-7B) is evaluated on the public VDC benchmark, which contains 1,027 test videos labeled for five granular dimensions via human-aligned annotation.

Caption Quality Measurement: For each dimension, captions are scored by both accuracy and a mean rating (1–5 scale) from a held-out scorer.
Aggregate Metrics: Primary metrics are average accuracy ( $\text{Accuracy}_{\rm avg}$ ) and average score ( $\text{Score}_{\rm avg}$ ) across five dimensions.
Comparative Performance: VDC-Agent-7B achieves $49.08\%$ accuracy and $2.50$ mean score, exceeding both its base (Qwen2.5-VL-7B-Instruct: $43.95\%/2.23$ ) and specialized captioners such as AVC-DPO-7B ( $47.70\%/2.47$ ) and OwlCap-7B ( $46.90\%/2.40$ ).

Inference Efficiency: Model inference is a single forward pass (mean $15.5\,\mathrm{s}$ /video, 1×A800 GPU); by contrast, test-time agentic refinement would require $\sim165\,\mathrm{s}$ /video. VDC-Agent internalizes the benefits of agentic self-improvement into a conventional, low-cost inference pipeline (Wang et al., 24 Nov 2025).

Model	Accuracy (avg)	Score (avg)
Qwen2.5-VL-7B-Instruct	43.95	2.23
VDC-Agent-7B	49.08	2.50
AVC-DPO-7B	47.70	2.47
OwlCap-7B	46.90	2.40

6. Broader Implications and Limitations

VDC-Agent demonstrates that a self-evolving paradigm—combining agentic chains of generation, principle-guided scoring, and reflective prompt updates—can construct high-quality preference data at scale and leverage curriculum DPO for effective model alignment. The approach requires neither human labels nor external teacher models and is robust to noisy self-generated data through a combination of filtering and contrastive preference pairing.

A plausible implication is that this framework generalizes to other multimodal generative settings where clean labels are scarce but principle-based evaluation is viable. However, the effectiveness of VDC-Agent depends on the validity and expressiveness of its principle set, the internal consistency of automatic scoring, and the capacity of the underlying MLLM.

7. References

"VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection" (Wang et al., 24 Nov 2025)

PDF Markdown Chat (Pro)

References (1)

VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to VDC-Agent.

VDC-Agent: Evolving Video Captioners

1. Agentic Iterative Refinement Framework

2. Autonomous Dataset Construction: VDC-Agent-19K

3. Curriculum-Based Direct Preference Optimization (DPO)

4. Principle-Guided Scoring and Prompt Self-Reflection

5. Evaluation Methodology and Quantitative Results

6. Broader Implications and Limitations

7. References

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

VDC-Agent: Evolving Video Captioners

1. Agentic Iterative Refinement Framework

2. Autonomous Dataset Construction: VDC-Agent-19K

3. Curriculum-Based Direct Preference Optimization (DPO)

4. Principle-Guided Scoring and Prompt Self-Reflection

5. Evaluation Methodology and Quantitative Results

6. Broader Implications and Limitations

7. References

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics