Active Video Perception (AVP)

Updated 8 December 2025

Active Video Perception (AVP) is an agentic framework where systems actively control visual input to selectively gather essential evidence from high-dimensional video streams.
It employs adaptive viewpoint selection, iterative evidence-seeking, and closed-loop decision making to optimize performance and reduce computational redundancy.
AVP is applied in diverse domains such as streaming video QA, long-video understanding, and embodied robotic vision, leading to enhanced efficiency and accuracy.

Active Video Perception (AVP) designates a paradigm in which a perception agent dynamically controls its visual input in order to achieve task-relevant understanding from high-dimensional, temporally evolving video streams. Unlike passive video analysis, where the perceptual pipeline is fixed or query-agnostic, AVP agents actively determine what, when, where, and how to observe, acquiring evidence in closed-loop interaction with the video or environment. This agentic perspective pervades contemporary approaches to long-video understanding, streaming interaction, and embodied robotic vision, and is characterized by iterative evidence-seeking, adaptive viewpoint selection, asynchronous processing, and learning-based policies for sensor or attention control.

1. Conceptual Foundations of Active Video Perception

Active Video Perception arises from the recognition that video data is both voluminous and redundant, with informative content often sparsely distributed across time and space. The agentic AVP framework generalizes the "active perception" principle from classic robotics—where agents move sensors to uncertainty-reducing positions (Satsangi et al., 2020)—to settings where sensor motion, temporal sampling, or query-driven feature selection can all be modulated.

Key distinctions from traditional paradigms include:

Closed-loop evidence acquisition: Rather than exhaustive offline captioning or dense pre-processing, AVP policies iteratively seek the most informative observations for a given task or query, trading off accuracy against resource cost (Wang et al., 5 Dec 2025).
Embodied and streaming interaction: AVP is central in settings where the agent must act/react in real time, e.g., as in streaming video QA (Qian et al., 6 Jan 2025), embodied robotic perception (Yang et al., 19 Nov 2025, Chuang et al., 26 Sep 2024), or manipulation with dynamic viewpoints (Chuang et al., 26 Sep 2024).
Decision-making under partial observation: AVP is naturally formalized as a (belief-based) partially observed Markov decision process, with rewards reflecting reduction in task uncertainty (Satsangi et al., 2020).

This paradigm shift enables substantial improvements in computational efficiency, precision, and generalization across diverse video-understanding domains.

2. Formalization and Algorithmic Paradigms

Active Video Perception is formally encapsulated via policies over sequential actions in a POMDP framework:

$\mathcal{M}=(\mathcal{S}, \mathcal{A}, \mathcal{O}, P, R, \gamma)$

where $\mathcal{S}$ represents latent video/environmental state, $\mathcal{A}$ comprises sensory actions (e.g., sampling plans, viewpoint changes), $\mathcal{O}$ denotes observations (frames, clips), and $R$ encodes task-utility versus computational/resource costs (Wang et al., 5 Dec 2025).

Contemporary AVP algorithms employ a variety of closed-loop structures:

Plan–Observe–Reflect Loops: Iteratively choose query-driven plans, extract targeted evidence, then reflect on sufficiency, continuing or halting as appropriate (Wang et al., 5 Dec 2025).
Disentangled Perception–Decision–Reaction Pipelines: Stream processing is decoupled into real-time perception, decision (whether to respond), and reaction (generating responses), with asynchronous scheduling to minimize latency bottlenecks (Qian et al., 6 Jan 2025).
Perception Loop Reasoning (PLR): Alternating cycles in which the agent selects a temporal/spatial region to describe, then decides whether justification is sufficient or further evidence is required. An anti-hallucination module evaluates factual consistency at each step (Pu et al., 23 Nov 2025).
Multi-modal, autoregressive integration: Embodied AVP systems (e.g., EyeVLA, AV-ALOHA) jointly tokenize and autoregress over image, language, and action trajectories, enabling viewpoint control and vision–language–action fusion in a single transformer backbone (Yang et al., 19 Nov 2025, Chuang et al., 26 Sep 2024).

Pseudocode for the iterative evidence-seeking process appears across all AVP methods:

while not sufficient_evidence:
    plan = planner(history, query)
    evidence = observer(plan)
    sufficient_evidence = reflector(evidence, query)

3. System Architectures and Methodological Variants

AVP systems are architecturally diverse, reflecting their target domain:

Application Domain	Core AVP Architecture	Notable Features
Streaming Video QA	Disentangled (Perception / Decision / Reaction)	Asynchronous LLM, temporal retrieval
Long-Video Understanding	Plan–Observe–Reflect loop	Query-driven, adaptive sampling
Embodied Robotic Vision (EyeVLA)	Transformer with vision/language/action tokenization	RL vision policy, token action space
Manipulation (AV-ALOHA)	Multi-arm, joint imitation-learning policy	Decoupled gaze/manipulator control

Dispider divides the streaming loop into (a) perception, with scene-adaptive clip segmentation at 1 FPS using SigLip embeddings and content shift detection; (b) decision, using a lightweight LLM to determine whether a user-interactive response is merited; (c) reaction, where detailed autoregressive responses are triggered asynchronously. Temporal evidence retrieval is learned via KL-regularized distributions over clip indicator tokens (Qian et al., 6 Jan 2025).

EyeVLA introduces a "robotic eyeball" gimbal and zoom system, with action behaviors discretized into a small set of action tokens appended to a transformer-based vision–LLM. Training involves supervised trajectory imitation and RL with group relative policy optimization, using both real and synthetic demonstration data. The integration of action tokens with vision-language modeling enables open-world camera control and fine-grained object localization in real robotic scenes (Yang et al., 19 Nov 2025).

AV-ALOHA trains a coupled manipulation–camera policy with multi-headed transformers, where the active vision policy dynamically selects the best viewpoint independently of manipulator trajectories. Demonstrations are provided through VR teleoperation, streaming stereo video from a 7-DoF vision arm (Chuang et al., 26 Sep 2024).

4. AVP in Long-Video and Streaming Understanding

Long video understanding (LVU) presents unique challenges: evidence is dispersed, queries can require temporally precise reasoning, and exhaustive processing is prohibitive. AVP frameworks specifically counteract these obstacles through selective, adaptive observation.

Iterative Evidence Seeking: AVP agents plan, observe, and reflect until an evidence sufficiency criterion is met, e.g., based on a confidence score or maximal rounds (Wang et al., 5 Dec 2025).
Fine-grained Retrieval: By grounding all evidence in explicit timestamps and query context, AVP retains spatiotemporal precision that generic captioners or static pipelines lack (Wang et al., 5 Dec 2025, Pu et al., 23 Nov 2025).
Efficient Resource Use: AVP typically reduces inference time (by up to $5\times$ ), token budget (by $8\times$ ), and achieves higher task accuracy versus state-of-the-art agentic baselines, with statistical significance confirmed by McNemar’s test on multiple benchmarks (Wang et al., 5 Dec 2025).

The Perception Loop Reasoning paradigm further addresses hallucination by integrating a Factual-Aware Evaluator module, which computes anti-hallucination rewards on each evidence segment, calibrated via a large, synthetic/hallucinated caption dataset (AnetHallu-117K). Empirically, this approach cuts hallucination rates by $15$–$27$ points on VideoHallucer and HEAVEN benchmarks (Pu et al., 23 Nov 2025).

5. Embodied and Robotic AVP

Physical agents implementing AVP must jointly optimize for spatial coverage, fine-grained observation, and downstream task performance under budget and latency constraints.

Viewpoint Control and Action Tokenization: As in EyeVLA, camera pan/tilt/zoom is discretized to tokens and handled within the same transformer model as language and vision, enabling joint planning, fusion, and “where-to-look-next” reasoning (Yang et al., 19 Nov 2025).
Policy Optimization: RL with reward components for viewpoint accuracy (IoU, mean absolute error in angles/zoom), and cross-modal efficiency ensures that the vision action policy converges to information-rich and cost-effective behaviors (Yang et al., 19 Nov 2025).
Imitation Learning: AV-ALOHA demonstrates that VR-guided human demonstrations are highly effective for teaching coupled manipulation–camera policies. The vision arm, by decoupling gaze from manipulation, significantly boosts performance on occlusion-sensitive or high-precision tasks (Chuang et al., 26 Sep 2024).

Design guidelines emerging from these systems include: AVP is beneficial for tasks involving occlusion or fine object attributes; minimal but mobile viewpoints outperform static over-instrumented setups; and alignment between camera action control and downstream model capacity is critical for robust generalization (Chuang et al., 26 Sep 2024, Yang et al., 19 Nov 2025).

6. Computational and Theoretical Underpinnings

Active perception tasks are naturally addressed within a POMDP framework with belief-dependent rewards, yet naïvely using entropy or uncertainty-based objectives breaks structural properties needed by solvers. Several developments address these challenges:

PWLC-Preserving Formulations: Both $\rho$ POMDP and POMDP-IR maintain value function piecewise-linearity via belief-based reward approximation or decomposing prediction and observation actions (Satsangi et al., 2020).
Greedy, Scalable Planning: Greedy Point-Based Value Iteration (PBVI) exploits submodularity under sensor-independence, providing theoretical performance guarantees and dramatically improved scaling with sensor/action set size (Satsangi et al., 2020).
Submodularity Results: Under appropriate independence conditions, AVP value functions are submodular in the sensing actions, ensuring that simple greedy maximization approaches remain near-optimal through recursive application of Nemhauser's bound (Satsangi et al., 2020).

Empirical results on simulated and real multi-camera multi-object tracking demonstrate that POMDP-IR with greedy PBVI can outperform coverage baselines with strong computational benefits, particularly under resource constraints and dynamic environments (Satsangi et al., 2020).

7. Current Limitations and Future Prospects

While AVP brings substantial benefits across video reasoning, streaming, and embodied vision, current frameworks also exhibit limitations:

Offline vs. Online Processing: State-of-the-art iterations (e.g., AVP for LVU, PLR) largely operate in an offline, pre-stored video regime; closed-loop online/streaming AVP remains an active research area (Wang et al., 5 Dec 2025, Qian et al., 6 Jan 2025).
Planning Policy Learning: Many systems rely on prompt-based or heuristic planners; end-to-end learned, reinforcement-trained, or uncertainty-calibrated planners are open directions (Wang et al., 5 Dec 2025, Yang et al., 19 Nov 2025).
Temporal/Spatial Granularity: Coarse sampling schemes risk missing short-duration events; integrating more sophisticated action/attention models may address this (Wang et al., 5 Dec 2025).
Compute Bottlenecks: Heavy reliance on transformer-based VL/LLM backbones in robotic or real-time deployments motivates research into lightweight architectures or early-exit pipelines (Yang et al., 19 Nov 2025).
Hallucination and Reasoning Integrity: Explicit anti-hallucination reward and factual evaluation are crucial, but mechanisms for cross-modal and consistency checking remain limited (Pu et al., 23 Nov 2025).

Future research avenues include reinforcement-learned planners, real-time multimodal fusion, multi-camera or multi-agent AVP, joint visual–audio action, and persistent long-horizon evidence accumulation. The ongoing convergence of agentic video understanding, streaming interaction, and embodied visual control ensures that AVP will remain central to advances in scalable, robust video perception.