Papers
Topics
Authors
Recent
2000 character limit reached

VDC-Agent: Evolving Video Captioners

Updated 25 November 2025
  • VDC-Agent is a closed-loop framework that uses agentic self-reflection and principle-guided evaluation to autonomously enhance video caption quality.
  • It constructs the VDC-Agent-19K dataset from unlabeled videos through pairwise preference extraction, enabling curriculum-based direct preference optimization.
  • Quantitative results show that VDC-Agent-7B outperforms baseline models with 49.08% accuracy and a 2.50 mean score on the VDC benchmark.

VDC-Agent refers to a closed-loop, self-evolving framework for Video Detailed Captioning (VDC) that leverages agentic self-reflection, curriculum-based preference optimization, and principle-guided feedback to autonomously refine multimodal LLMs for high-fidelity video caption generation. The VDC-Agent paradigm is instantiated in the system introduced by "VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection" (Wang et al., 24 Nov 2025), which establishes new methodologies for self-improvement and benchmarking in video captioning at scale, without requiring human annotations or teacher models.

1. Agentic Iterative Refinement Framework

The core innovation of VDC-Agent is its agentic closed-loop process that enables a captioner to iteratively improve itself without external supervision. The system cycles through the following sequence for each unlabeled video:

  1. Caption Generation: At time step tt, the agent uses an underlying MLLM (Qwen2.5-VL-7B-Instruct) with prompt ptp_t to generate candidate caption yty_t.
  2. Principle-Guided Evaluation: The system evaluates yty_t using a predefined set of textual principles RR (e.g., referencing camera motion, spatial arrangement, object interactions, temporal structure), producing a quality score st[0,100]s_t \in [0,100] and suggestion gtg_t for improvement.
  3. Prompt Update: If sts_t improves or meets the target threshold λ\lambda (default λ=90\lambda=90), the prompt is updated using the suggestion. If refinement leads to regression (st<st1s_t < s_{t-1}), the system triggers a self-reflection module that inspects the prior chain-of-thought to amend the prompt and avoid cyclic degradations.
  4. Termination: The loop continues for up to TT rounds (default T=4T=4) or until the score threshold is reached.

This process yields a trajectory P(x)={(y0,s0),,(yTv,sTv)}P(x) = \{(y_0, s_0), \ldots, (y_{T_v}, s_{T_v})\} of caption/score pairs per video, with TvTT_v \leq T.

2. Autonomous Dataset Construction: VDC-Agent-19K

VDC-Agent constructs its own training data from unlabeled sources without human annotations. The data generation pipeline is as follows:

  • Input Pool: 4,008 high-resolution, unlabeled videos from the Cockatiel-4K corpus.
  • Task Dimensions: For each video, the agentic loop is run under five evaluation dimensions: camera, short, background, main-object, and detailed, yielding trajectories for each.
  • Trajectory Filtering: Trajectories with only a single caption (no refinement necessary) or with JSON parsing errors are removed.
  • Pairwise Preferences: For each valid trajectory, the best and worst captions (by score) are extracted as (x,y+,y,Δs)(x, y^+, y^-, \Delta s) with Δs=s+s\Delta s = s^+ - s^- as a difficulty measure. This yields the VDC-Agent-19K dataset of 18,886 preference pairs, providing graded supervisory signals for subsequent model optimization.

3. Curriculum-Based Direct Preference Optimization (DPO)

VDC-Agent advances from agentic data to model update via curriculum-guided DPO. The training methodology is:

  • Loss Function: For each preference tuple (x,y+,y)(x, y^+, y^-), DPO optimizes the policy πθ\pi_\theta to maximize preference for y+y^+ over yy^- relative to a fixed reference policy πref\pi_{\rm ref}. The DPO loss is:

LDPO(θ;x,y+,y)=logσ[β(logπθ(y+x)πθ(yx)logπref(y+x)πref(yx))],L_{\rm DPO}(\theta; x, y^+, y^-) = -\log \sigma \left[ \beta \cdot \left( \log \frac{\pi_\theta(y^+|x)}{\pi_\theta(y^-|x)} - \log \frac{\pi_{\rm ref}(y^+|x)}{\pi_{\rm ref}(y^-|x)} \right) \right],

where σ\sigma is the logistic sigmoid and β\beta controls the KL regularization strength.

  • Curriculum Ordering: Preference examples are sorted by decreasing Δs\Delta s (easy-to-hard), so large-contrast examples dominate early mini-batches, facilitating rapid convergence and decreasing gradient variance. Difficult cases refine discriminative capacity in later phases.
  • Tuning Regime: The backbone model (Qwen2.5-VL-7B-Instruct) is fine-tuned with LoRA adapters (rank=16, alpha=32, dropout=0.1). Optimization uses AdamW and a cosine-decaying learning rate.

4. Principle-Guided Scoring and Prompt Self-Reflection

VDC-Agent's autonomous evaluation relies on an internal scorer, parameterized by a compact set of caption quality principles. These principles encode requirements such as reference to camera motion, scene layout, salience of main objects, background details, and temporal transitions.

  • Prompt Refinement: After every scoring iteration, the system combines suggestion gtg_t with fixed instruction templates to update the prompting strategy. If performance regresses, the system's self-reflection path activates: the prompt update mechanism receives both the degraded output and the prior chain-of-thought, supporting prompt inputs conditioned on failure explanations. This generative chain-of-thought self-diagnosis distinguishes VDC-Agent from static or purely externally-supervised schemes.

5. Evaluation Methodology and Quantitative Results

The final model (VDC-Agent-7B) is evaluated on the public VDC benchmark, which contains 1,027 test videos labeled for five granular dimensions via human-aligned annotation.

  • Caption Quality Measurement: For each dimension, captions are scored by both accuracy and a mean rating (1–5 scale) from a held-out scorer.
  • Aggregate Metrics: Primary metrics are average accuracy (Accuracyavg\text{Accuracy}_{\rm avg}) and average score (Scoreavg\text{Score}_{\rm avg}) across five dimensions.
  • Comparative Performance: VDC-Agent-7B achieves 49.08%49.08\% accuracy and $2.50$ mean score, exceeding both its base (Qwen2.5-VL-7B-Instruct: 43.95%/2.2343.95\%/2.23) and specialized captioners such as AVC-DPO-7B (47.70%/2.4747.70\%/2.47) and OwlCap-7B (46.90%/2.4046.90\%/2.40).

Inference Efficiency: Model inference is a single forward pass (mean 15.5s15.5\,\mathrm{s}/video, 1×A800 GPU); by contrast, test-time agentic refinement would require 165s\sim165\,\mathrm{s}/video. VDC-Agent internalizes the benefits of agentic self-improvement into a conventional, low-cost inference pipeline (Wang et al., 24 Nov 2025).

Model Accuracy (avg) Score (avg)
Qwen2.5-VL-7B-Instruct 43.95 2.23
VDC-Agent-7B 49.08 2.50
AVC-DPO-7B 47.70 2.47
OwlCap-7B 46.90 2.40

6. Broader Implications and Limitations

VDC-Agent demonstrates that a self-evolving paradigm—combining agentic chains of generation, principle-guided scoring, and reflective prompt updates—can construct high-quality preference data at scale and leverage curriculum DPO for effective model alignment. The approach requires neither human labels nor external teacher models and is robust to noisy self-generated data through a combination of filtering and contrastive preference pairing.

A plausible implication is that this framework generalizes to other multimodal generative settings where clean labels are scarce but principle-based evaluation is viable. However, the effectiveness of VDC-Agent depends on the validity and expressiveness of its principle set, the internal consistency of automatic scoring, and the capacity of the underlying MLLM.

7. References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to VDC-Agent.