Video-As-Prompt (VAP) Paradigm

Updated 27 October 2025

Video-As-Prompt (VAP) is an emerging paradigm that uses complete videos, segments, or derived representations to guide semantic analysis and controllable video synthesis.
It enables unified, zero-shot control across diverse tasks such as video understanding, robotics, and multimodal evaluation by replacing fixed prompts with dynamic video cues.
VAP frameworks leverage specialized architectures and curated datasets to enhance evaluation fidelity, semantic control, and efficient resource allocation in video-centric applications.

Video-As-Prompt (VAP) is an emerging paradigm that leverages whole videos, video segments, or video-derived representations as control signals or guides for learning, inference, and generative tasks. While prompting concepts initially developed for LLMs (text prompts) and later for images (visual prompts), VAP generalizes this methodology to enable semantic, structural, or functional control and measurement using video data. This approach is central not only to video understanding and analytics but also to controllable video generation, robotics, and multimodal system evaluation.

1. Foundations and Motivation

The key motivation for VAP lies in the stringent requirements of modern video-centric tasks: efficient semantic control, evaluation fidelity, action transferability, and cross-modal grounding cannot always be achieved by fixed prompts or per-frame data. Unlike structure-based or condition-specific methods—which enforce pixel-level alignment or require task-specific tuning—VAP treats a reference video, a segment, or abstracted cues (e.g., skeletons, subtitles, embedded trajectories) as direct in-context prompts for downstream models (Bian et al., 23 Oct 2025).

VAP seeks to resolve several longstanding challenges, such as:

Avoiding artifacts caused by mismatched pixel-wise priors in structure-based control.
Achieving semantic control scalable to diverse downstream conditions (style, motion, concept, camera dynamics).
Enabling zero-shot or in-context generalization using a unified model architecture.
Providing measurable, fine-grained evaluation signals tailored to the real variability and semantics of video content.

2. VAP in Video Analysis: Pipelines and Evaluation

Early instances of VAP framed video analytics pipelines as systems which combine DNN speedup/compression with video-specific heuristics (e.g., temporal or spatial pruning, model switching) (Xiao et al., 2021). Here, the "prompt" is not user input but the dynamic characteristics of the video itself, which guide pipeline adaptation and resource allocation for edge analytics.

Yoda, a benchmarking framework, systematizes VAP evaluation by:

Decomposing pipelines into independent resource-saving primitives (temporal, spatial, model).
Profiling primitive effect under a carefully curated set of video content characteristics (object density, speed, motion statistics).
Building a multiplicative performance clarity profile:

$P_{\text{v}}(x) = P_{t, s^*, m^*}(x) \times P_{t^*, s, m^*}(x) \times P_{t^*, s^*, m}(x)$

where each $P$ quantifies the impact of a primitive under oracle (most accurate) settings for the others, and $x$ denotes a content feature vector.

This methodology yields a lookup-table performance profile mapping video content to accuracy/cost tradeoffs, improving transparency and enabling rapid prediction of expected performance for novel videos (Xiao et al., 2021).

In video question answering and temporal grounding, VAP provides mechanisms for translating video content into prompt signals for LLMs. For instance, VPTSL integrates timestamped subtitles with visual highlight features (derived by cross-modal attention over video frames and queries) into a pre-trained LLM; the full input sequence (question + subtitles + visual prompt tokens) allows the model to predict text span answers aligned with visual evidence (Li et al., 2022).

Formally, after joint encoding by a LLM (e.g., DeBERTa):

$h = \text{DeBERTa}(x)$

The answer span is selected by:

$s = \operatorname{argmax}(\text{softmax}(W_1 h + b_1)), \quad e = \operatorname{argmax}(\text{softmax}(W_2 h + b_2))$

Significant empirical gains (e.g., +28.36% mIoU on MedVidQA) are attributed to the ability of visual prompts to bridge the semantic gap between vision and language (Li et al., 2022).

4. VAP in Controllable Video Generation

In generative video models, VAP enables unified, generalizable semantic control by treating a reference video as a prompt for synthesis. The VAP framework introduces a frozen Video Diffusion Transformer (DiT), responsible for target video synthesis, and a plug-and-play Mixture-of-Transformers (MoT) expert, which processes the tokens of the reference video prompt (Bian et al., 23 Oct 2025). A temporally biased position embedding ensures that in-context guidance from the reference avoids introducing spurious, unrealistic pixel-wise priors:

Temporal indices of reference tokens are shifted by an offset $\Delta$ , ensuring correct ordering and preventing pixel-aligned assumptions.

The model learns the conditional distribution $p(x|c)$ under a diffusion process:

$x_t = t x_1 + [1 - (1 - \sigma_{min}) t] x_0, \quad V_t = x_1 - (1 - \sigma_{min}) x_0$

The objective:

$\mathcal{L} = \mathbb{E}_{t, x_0, x_1, C} \| u_{\Theta}(x_t, t, C) - [x_1 - (1 - \sigma_{min}) x_0] \|$

VAP’s architecture achieves new state-of-the-art user-preference rates for open-source models (38.7%) while unifying a broad range of semantic control tasks—style, motion, camera, concept—in a single unified model (Bian et al., 23 Oct 2025).

5. Representational Innovations: Visual Action and Structural Prompts

To move beyond ambiguous high-level (text) action instructions or non-transferable low-level agent states, VAP approaches have rendered complex action signals or temporal sequences into visual prompt representations, most notably visual skeletons (Wang et al., 18 Aug 2025):

$v_{1:T} = \mathcal{R}(a_{0:T-1}) \in \mathbb{R}^{T \times H \times W \times C}$

where $\mathcal{R}$ is a rendering operator from action sequences $a_{0:T-1}$ , and $v_{1:T}$ serves as a precise, domain-agnostic control for both human-object interaction (HOI) and robotic datasets. The encoded trajectory is injected into the generative model via a ControlNet-like adapter, ensuring geometric precision and cross-domain adaptability.

Experimental results indicate significant improvements in fidelity and dynamic accuracy across benchmark datasets (EgoVid, RT-1, DROID), outperforming both text and agent-centric controls (Wang et al., 18 Aug 2025).

6. Datasets and Benchmarks: Scaling and Evaluating VAP

The recent proliferation of large-scale prompt-gallery datasets (VidProM for text-to-video (Wang et al., 10 Mar 2024), TIP-I2V for image-to-video (Wang et al., 5 Nov 2024), VAP-Data for semantic control (Bian et al., 23 Oct 2025)) provides the foundation for empirically robust VAP research. These resources contain:

Millions of real user prompts (text or image pairs) with generated videos.
Detailed metadata (UUIDs, embeddings, NSFW scores), enabling nuanced studies of prompt diversity and performance.
Task-specific structures (e.g., VAP-Data’s semantic condition pairings, TIP-I2V’s subject/direction annotations) for targeted evaluation.
Support for new research areas such as prompt engineering, video copy detection, efficient generation via prompt retrieval, and model safety auditing.

These datasets shift quantitative analysis toward more realistic, user-driven or content-driven prompt paradigms, supporting both controlled generation and content-grounded evaluation.

7. Implications and Future Directions

VAP’s unification of prompting, control, and evaluation via video has the following broad implications:

Generalizable Semantic Control: By abstracting control away from pixel- or condition-specific priors, VAP architectures (e.g., with frozen generative backbones and dedicated prompt experts) allow zero-shot, unified guidance across multiple control types in video synthesis (Bian et al., 23 Oct 2025).
Content-Aware Analytics and Adaptation: Modular VAP evaluation frameworks (such as Yoda) enable operators to tailor system performance dynamically to the characteristics of incoming video, and suggest design of adaptive pipelines that alter strategies based on real-time content distribution (Xiao et al., 2021).
Cross-Domain and Multimodal Fusion: Visual prompts (e.g., skeletons for action-to-video, subtitles for QA) enable transfer learning and more precise cross-domain reasoning, benefiting tasks in robotics, video-language understanding, and generative modeling (Li et al., 2022, Zhu et al., 27 May 2025, Wang et al., 18 Aug 2025).
Limitations and Open Problems: The scalability of plug-and-play expert modules, reliance on synthetic datasets for certain control dimensions, and the efficiency of real-time fitting/inference remain active challenges (Bian et al., 23 Oct 2025). Extension to multi-modal, instruction-style, or process-level supervision (as in UniAPO (Zhu et al., 25 Aug 2025)) is a promising future direction.
Benchmarking Human-Model Interaction: Benchmarks such as V2P-Bench evaluate LVLMs' understanding of visual prompts in video, providing evidence that current models lag behind human performance when prompted by visual overlays, and highlighting the need for architectures better aligned with human cognitive grounding (Zhao et al., 22 Mar 2025).

In sum, VAP operationalizes video as a comprehensive, versatile prompt. This unification supports semantic control, adaptive analysis, and advanced generative modeling, and underpins a range of future research at the intersection of video understanding, multimodal learning, and human-computer interaction.