VC-Agent: An Interactive Agent for Customized Video Dataset Collection (2509.21291v1)

Published 25 Sep 2025 in cs.AI and cs.CV

Abstract: Facing scaling laws, video data from the internet becomes increasingly important. However, collecting extensive videos that meet specific needs is extremely labor-intensive and time-consuming. In this work, we study the way to expedite this collection process and propose VC-Agent, the first interactive agent that is able to understand users' queries and feedback, and accordingly retrieve/scale up relevant video clips with minimal user input. Specifically, considering the user interface, our agent defines various user-friendly ways for the user to specify requirements based on textual descriptions and confirmations. As for agent functions, we leverage existing multi-modal LLMs to connect the user's requirements with the video content. More importantly, we propose two novel filtering policies that can be updated when user interaction is continually performed. Finally, we provide a new benchmark for personalized video dataset collection, and carefully conduct the user study to verify our agent's usage in various real scenarios. Extensive experiments demonstrate the effectiveness and efficiency of our agent for customized video dataset collection. Project page: https://allenyidan.github.io/vcagent_page/.

Summary

The paper introduces a dual-policy interactive framework that leverages MLLM and iterative user feedback to refine video dataset collection.
The system employs attribute-aware rejection and template-based acceptance policies to filter and rank video clips, achieving superior IoU scores compared to traditional methods.
Experimental results show that VC-Agent accelerates dataset curation, enhances downstream tasks like text-to-video generation, and scales efficiently with minimal user input.

VC-Agent: An Interactive Agent for Customized Video Dataset Collection

Introduction and Motivation

The VC-Agent framework addresses the escalating demand for scalable, high-quality, and domain-specific video datasets, which are essential for advancing video understanding, generation, and related tasks. Manual curation of such datasets is prohibitively labor-intensive, especially when requirements are complex or highly personalized. VC-Agent introduces an interactive, multimodal LLM (MLLM)-based agent that iteratively refines user requirements and automates the retrieval and filtering of relevant video clips from web-scale sources, minimizing user effort while maximizing dataset quality and specificity.

System Architecture

VC-Agent is architected as a two-part system: a user-facing interface and a backend agent function module.

User Interface: Supports three interaction modalities—initial query, confirmation, and comment. Users begin with a coarse textual query, review candidate video samples, confirm or reject them, and provide attribute-specific comments for rejected samples. This iterative feedback loop enables progressive clarification and refinement of requirements.
Agent Functions: The backend leverages MLLMs and video grounding models to retrieve, describe, and filter candidate videos. Two novel, updatable filtering policies are central to the system: a Template-Based Acceptance Policy and an Attribute-Aware Rejection Policy. These policies are dynamically updated based on user feedback, enabling the agent to learn nuanced criteria for both acceptance and rejection.
Figure 1: The illustration of the entire workflow of VC-Agent, showing the iterative interaction between user and agent, and the progressive refinement of filtering policies.

User Interface and Interaction Design

The web-based UI is designed for accessibility and efficiency, allowing untrained users to operate the system with minimal friction. Users input queries, review returned video samples, and provide feedback through simple retain/discard actions and optional comments. The system tracks all interactions and updates filtering policies accordingly.

Figure 2: The web-based user interface, enabling users to submit queries, review video samples, and provide feedback for iterative refinement.

Video Retrieval and Grounding

Upon receiving a user query, VC-Agent generates search keywords via LLM prompting and employs a web crawler to retrieve candidate videos. To address the prevalence of irrelevant content, the system applies temporal and spatial grounding models (TFVTG and Grounding DINO) to extract segments most relevant to the query. Candidate videos are then ranked by similarity to the user’s requirements using MLLM-based descriptors.

Iterative Filtering Policies

Attribute-Aware Rejection Policy

User comments on rejected videos are parsed to extract attribute-specific rejection criteria (e.g., "no black cat", "cat should lie down"). These are summarized into a negative standard table. For each candidate video, the agent uses MLLM to generate attribute-specific descriptions and compares them against the table, discarding videos that match any negative criteria.

Template-Based Acceptance Policy

Accepted videos are described by MLLM and aggregated into positive criterion templates. Candidate videos passing the rejection filter are compared against these templates, and those with high similarity are retained.

Policy Update and Double-Check Strategy

Policies are updated in each interaction round, incorporating new confirmations and comments. To enhance robustness, a double-check strategy is employed: videos with low confidence scores are buffered and periodically presented to the user for additional review, focusing attention on ambiguous attributes.

Figure 3: Overview of the user-assisted double-check strategy, which improves robustness by soliciting user feedback on low-confidence samples.

Benchmarking and Experimental Results

VC-Agent is evaluated on a newly proposed Personalized Video Collection Benchmark (PVB), comprising 10,000 annotated videos across three domains, each with five distinct requirements. The benchmark is designed to test the agent’s ability to handle complex, multi-attribute queries beyond the scope of traditional video retrieval tasks.

Quantitative results demonstrate that VC-Agent consistently outperforms state-of-the-art video retrieval and MLLM baselines (e.g., InternVideo-L, VAST, GRAM, LLAVA-OneVision-7B), especially as the number and specificity of requirements increase. The agent maintains high IoU scores even under stringent multi-attribute constraints, whereas baseline methods degrade rapidly.

Figure 4: Progressive filtering results as requirements are incrementally added, showing VC-Agent’s superior performance compared to MLLM and video retrieval baselines.

Ablation Studies

Ablation experiments confirm the necessity of each module. Removing either the rejection or acceptance policy, attribute-aware or template-based mechanisms, or iterative updates results in significant performance drops, especially under complex requirements. The double-check strategy further improves accuracy and robustness.

Real-World User Studies

Eight participants constructed over 335,000 video samples across diverse domains using VC-Agent, with high satisfaction rates and substantial reductions in manual collection time. Typically, 10 interaction rounds (including 1–3 double-check rounds) suffice for high-quality dataset construction. System Usability Scale (SUS) scores indicate strong usability and user satisfaction.

Figure 5: SUS questionnaire results, indicating high user satisfaction with VC-Agent’s effectiveness and efficiency.

Downstream Task Impact

Finetuning state-of-the-art models on datasets collected by VC-Agent yields measurable improvements in specialized tasks:

Text-to-Video Generation: Finetuning CogVideoX with VC-Agent data improves metrics such as CLIPScore, HPSv2, temporal flickering, and subject consistency compared to both baseline and LLAVA-OneVision-7B-collected data.

Figure 6: Qualitative results for text-to-video generation, showing improved realism and detail when finetuned on VC-Agent data.

Biped Cartoon Pose Estimation: Finetuning RTMPose-L on VC-Agent data significantly boosts AP and AR metrics, enabling accurate pose estimation for anthropomorphic animated characters.
Figure 7: Qualitative results for biped cartoon pose estimation, demonstrating effective identification of animated character features.

Scalability, Efficiency, and Limitations

VC-Agent achieves high-quality dataset construction with minimal user input and scales efficiently to large datasets (up to 80k samples). The system incurs higher computational cost per video (1.53s) compared to traditional retrieval (0.42s), but this is offset by superior generalization and specificity. Limitations include reduced performance on highly complex or ambiguous requirements (e.g., multi-view shots, transient object movements), which are primarily constrained by current MLLM capabilities.

Figure 8: Failure cases observed during user studies, highlighting challenges in distinguishing highly realistic virtual scenes and transient object movements.

Implications and Future Directions

VC-Agent demonstrates that interactive, MLLM-driven agents can dramatically accelerate the construction of customized video datasets, enabling rapid adaptation to specialized domains and tasks. The dual-policy filtering mechanism and iterative user feedback loop are effective for capturing nuanced requirements. Future work may focus on improving MLLM grounding for complex attributes, optimizing computational efficiency, and extending the framework to additional modalities and annotation types.

Conclusion

VC-Agent establishes a robust paradigm for interactive, scalable, and high-quality video dataset collection. By integrating multimodal reasoning, iterative policy refinement, and user-centric design, it enables efficient construction of domain-specific datasets that directly benefit downstream model performance. The framework’s extensibility and empirical results suggest broad applicability across video-centric AI research and industry applications.