EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Published 24 Mar 2026 in cs.CV, cs.AI, and cs.CL | (2603.22918v1)

Abstract: Video understanding with multimodal LLMs (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods. Our code and model are available at https://github.com/wangruohui/EfficientVideoAgent.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces a planning-before-perception RL framework that dynamically selects visual tokens to enhance long video question-answering accuracy.
It employs a three-stage training pipeline (SFT, KTO, GRPO) to optimize efficiency and performance, yielding 6–12% accuracy gains with significantly fewer frames.
The approach demonstrates robust transfer on diverse long-form video benchmarks, providing a scalable blueprint for adaptive, agentic multimodal systems.

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Context and Motivation

Significant challenges persist in leveraging MLLMs for video understanding, especially for long-context video QA and reasoning tasks. The primary computational bottleneck derives from the vast sequence length of visual tokens, temporal dependencies, and redundant frames, which compound the inefficiencies of conventional paradigms centered on passive perception or uniform sampling. Traditional agentic video frameworks have introduced tool-based interaction, yet their workflows are typically static, perception-first, or lack adaptive planning. Such limitations make them suboptimal for context-sensitive, efficient video analytics. The EVA framework addresses these bottlenecks by reframing video understanding as an active, query-driven process with integrated planning, perception, and reasoning.

Agentic Planning-Before-Perception

EVA proposes an active, iterative summary–plan–action–reflection paradigm to achieve fine-grained, interpretable, and highly efficient video analysis. Unlike classical approaches that immediately encode long frame sequences, EVA first reasons over the query to decide what, when, and how to observe, only then extracting visual information in rounds with highly variable temporal and spatial granularity. This design enables strategic token allocation across the video timeline, with the flexibility to perform coarse-to-fine search, dense or sparse sampling, and autonomous adaptation to the task.

Figure 1: EVA leverages planning-before-perception, enabling coarse global scans followed by focused, high-resolution retrieval for precise answers on extremely long videos.

This paradigm tackles several core inefficiencies:

Avoidance of visual misguidance inherent in perception-first workflows, which are susceptible to irrelevant or misleading content.
Redundant computation minimization via active selection and dynamic adjustment of the visual frame budget.
Autonomy in workflow construction, transcending fixed pipelines with adaptive multi-turn planning and flexible tool parameterization.

Reinforcement Learning Pipeline

EVA’s three-stage training protocol comprises:

Supervised Fine-Tuning (SFT): Cold-start agent behavior induction with synthetic high-quality summary–plan–action–reflection trajectories, enabling early competence in tool calling, interleaved frame–text reasoning, and core planning schemas.
Kahneman–Tversky Optimization (KTO): Stability and robustness are achieved by learning preference-based policies over both successful and failed strategies. KTO labels and corrects typical failure cases, such as premature answer guessing or inefficient sampling, using single-sample preference signals (chosen/rejected).
Generalized Reward Policy Optimization (GRPO): Final online refinement via RL on open-ended and multiple-choice QA, employing composite rewards blending accuracy (ROUGE for open-ended, CSV for MCQ) and a formatting penalty to suppress reward hacking by ungrounded response patterns.
Figure 2: EVA’s data pipeline and staged RL: SFT for cold-start, KTO for failure correction, and GRPO for scalable policy optimization.

This protocol yields policies that not only exploit flexible action spaces but also maintain sample-efficient reasoning across diverse QA tasks, with high generalization capacity for previously unseen video queries.

Frame Selection Schema

Central to EVA’s agentic autonomy is a frame-selection tool with tunable parameters: temporal window, frame count, and spatial resolution. At each step, the agent emits a tuple specifying the exact content to retrieve, enabling compound behaviors such as global fast scans, local zoom-in, or spatial/temporal abstraction according to query demands. Contrary to prior tool-augmented agents bound by rigid usage (e.g., fixed sampling), EVA incrementally constructs context-appropriate plans.

Empirical Analysis

Benchmark Results

EVA establishes new state-of-the-art performance across six long-video understanding benchmarks, outperforming both open-source and adaptive agent baselines. Notably, it achieves 6–12% gains vs. vanilla MLLMs and 1–3% over prior agentic models, while using orders of magnitude fewer visual tokens.

Prominent findings include:

On LSDBench, EVA attains 51.8% accuracy with only 6.2k tokens, outperforming Qwen2.5-VL (49.2%, 21k tokens) and substantially surpassing baseline MLLMs constrained by brute-force dense sampling.
On long-form video tasks (LongVideoBench, MLVU, VideoMME, LVBench), EVA maintains top-tier accuracy (55.1%–60.5%) with 20–30 frames per video—a >10x reduction in frame count relative to static methods.
On Video-Holmes (zero-shot reasoning), EVA demonstrates robust transfer and competitive generalization, highlighting the efficacy of the planning-first paradigm for causal, social, and timeline inference tasks.
Figure 3: Distribution of dialogue rounds and visual tokens across models and benchmarks shows that EVA’s iterative, fine-grained perception yields higher accuracy with lower resource expenditure.

Ablation and Behavioral Study

Ablation confirms the significance of a full SFT–KTO–GRPO schedule. SFT alone trains tool format compliance but yields low exploration; KTO and GRPO incrementally induce strategic, multi-round planning—with GRPO further reducing frame count while increasing meaningful reasoning steps.

Figure 4: EVA’s dynamic workflow generation enables case-specific adaptation in visual evidence gathering and tool invocation.

Analysis of token attribution shows that initial rounds perform global, low-resolution scans, followed by targeted, high-resolution zoom-in when necessary—demonstrating sophisticated, policy-driven allocation that avoids both under- and over-exploration.

Figure 5: EVA distributes its visual computation adaptively across rounds, unlike baselines that allocate frames uniformly upfront.

Practical and Theoretical Implications

EVA decisively resolves the sampling–reasoning trade-off for long-horizon video QA by integrating high-level planning directly into the perception pipeline. This paradigm supports deployment on cost- and computation-sensitive platforms, offering dual benefits of interpretability (through explicit autonomous policy analysis) and resource efficiency. By linking video observation to dynamic, query-driven plans, EVA offers a scalable blueprint for future agentic multimodal systems that must generalize to arbitrary queries, long-context reasoning, and variable environmental demands.

On the theoretical axis, EVA provides evidence that active planning—operationalized as iterative summary–plan–action–reflection in a deep RL framework—outperforms both imitation-only and fixed-pipeline RL approaches. This motivates further exploration into continual learning, meta-reasoning (evolution of tool ecosystems), and cross-modal memory systems to augment long-term, autonomous multimodal agents.

Conclusion

EVA marks a material advance in autonomous video agent design, demonstrating that an RL-optimized, planning-first MLLM can exceed the performance and efficiency of static or weakly agentic baselines. Through staged training, open-ended RL, and a highly flexible action space, EVA achieves high accuracy on extensive long-form benchmarks with minimal visual computation. Nevertheless, dependence on engineered tool interfaces and the need for robust out-of-distribution reasoning persist as open challenges. Subsequent work should address dynamic toolset adaptation, self-improving loop architectures, and memory-augmented RL to further extend the generality of agentic video understanding (2603.22918).

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces EVA, a smart “video-watching” AI agent that doesn’t just passively look at every frame of a video. Instead, it plans first, then decides what parts to watch, when to watch them, and at what quality. The goal is to answer questions about very long videos quickly and accurately, without wasting time on unimportant parts.

What questions did the researchers ask?

The researchers focused on three big questions, explained in simple terms:

Can an AI learn to “skim” a long video and then zoom in on important moments to answer a question?
Can it plan before it looks—deciding what to watch based only on the question—and then adjust as it learns more?
Can it do all this efficiently (using fewer “tokens,” or small chunks of information) while still being accurate?

How did they do it?

EVA is built as an “agent” that works in a loop:

It starts with the question and makes a plan.
It chooses what part of the video to watch and how (for example: short clip vs. long clip, low vs. high resolution).
It watches, reflects on what it learned, and decides what to do next.
It repeats until it has enough evidence to answer.

Think of it like how you’d handle a 2-hour video when you only need to answer one question: first skim quickly to find the right section, then rewatch that part carefully in high quality.

The “tool” EVA uses

EVA has a flexible video-sampling tool it can call with settings like:

start_time and end_time (when in the video to look)
nframes (how many frames to sample)
resize (how sharp or blurry the frames should be)

This lets EVA first get a fast, low-res overview and then zoom in with more frames and higher resolution where it matters.

How EVA was trained (3 stages)

The team trained EVA in three steps so it could plan, watch, and learn like a skilled problem-solver.

Stage 1: Supervised Fine-Tuning (SFT)

Like learning by example.
A larger “teacher” model shows EVA how to:
- Write plans,
- Call the tool properly,
- Describe frames,
- Decide whether it has enough evidence to answer or should look more.

Stage 2: KTO (Kahneman–Tversky Optimization)

Like learning from common mistakes.
EVA studies both good and bad strategies (for example: guessing without enough evidence, or sampling too many/too few frames).
This helps it avoid wasteful or risky behavior.

Stage 3: GRPO (Generalized Reward Policy Optimization)

Like practice with a score.
EVA tries different strategies and gets rewarded for correct, well-supported answers.
For multiple-choice questions, it only gets full credit if it both picks the right answer and clearly used the right frames.
For open-ended answers, it earns points based on how closely its answer matches the correct one (measured by text overlap scores like ROUGE).

The team also built three high-quality datasets for these stages (EVA-SFT, EVA-KTO, and EVA-RL) so the training is stable and reproducible.

What did they find?

EVA was tested on six video benchmarks, including ones with very long videos. The main takeaways:

It’s more accurate: EVA improved scores by about 6–12% over regular multimodal models and by 1–3% over other “agent” methods.
It’s more efficient: Instead of watching tons of frames, EVA uses far fewer visual tokens by smartly choosing when and how to watch.
It works well on long videos: EVA is especially strong when the video is very long, where uniform sampling (just grabbing frames evenly) often misses key moments.
It generalizes: Even without special training for certain tests (zero-shot), EVA performed competitively on a challenging reasoning benchmark (Video-Holmes).

Why this matters: EVA shows that “planning before looking” beats “look at everything and hope for the best,” especially when videos are long and time is limited.

Why is this important?

Smarter, faster assistants: EVA could help summarize lectures, analyze sports games, or spot key events in security footage much faster.
Less compute, more impact: By watching fewer, better-chosen frames, it saves time and computing power.
More human-like strategy: EVA acts more like a thoughtful student—skimming, focusing, and double-checking—rather than a brute-force machine.
A step toward autonomous agents: This planning–action–reflection loop could be useful beyond videos, in any task where selective attention and efficient reasoning matter.

Simple caveat: EVA still relies on the tools and data it’s given, and very unusual or noisy questions can still be hard. Future work will explore more flexible tools and memory, so the agent can adapt even better over time.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of the key uncertainties and unexplored areas that remain after this work.

Generalization beyond predefined tools: EVA relies on a fixed frame-selection API (start_time, end_time, nframes, resize). It is unclear how policies transfer to unseen or richer tool ecosystems (e.g., ROI cropping, multi-crop, object tracking, optical flow, ASR) or how to automatically discover/adapt to new tool interfaces.
Missing modalities: The agent only reasons over frames; audio, subtitles/ASR, and motion cues (e.g., optical flow) are not integrated. It is unknown how incorporating these signals affects planning and accuracy for audio-dependent or fine-grained motion tasks.
No explicit budget-aware objective: The RL rewards optimize answer quality but do not penalize visual-token usage or latency directly. How to design and tune budget-sensitive rewards (e.g., cost per frame, per round, or wall-clock) to produce controllable efficiency–accuracy trade-offs remains open.
Reward reliability and bias:
- Open-ended rewards use ROUGE, which favors lexical overlap and may not reflect semantic correctness or grounding.
- The CSV reward uses the same base model as the judge, risking bias and overfitting to the judge’s preferences.
- How to build robust, calibrated, and independent reward models (or human-in-the-loop signals) for both factuality and grounding remains unresolved.
Reward hacking and hallucination: While the paper notes reward hacking, there is no formal assessment or automatic detection of unsupported answers. Systematic metrics and training-time safeguards against hallucinations and ungrounded reasoning are needed.
Data quality and leakage risk: Large portions of SFT/KTO/RL data are generated by an MLLM teacher. The extent of teacher hallucinations, label noise, and potential benchmark overlap (train–test leakage) is not quantified. Robust deduplication, data audits, and error analyses are lacking.
Scalability to very long or streaming video: Although EVA handles long videos, there is no evaluation on hours-long or live-stream settings, nor a streaming memory mechanism for continual processing. How to maintain performance with bounded memory and online updates is an open question.
Efficiency reporting: Token counts are estimated (e.g., “650 tokens per frame”), but actual wall-clock latency, FLOPs, memory, and encoder reuse/caching are not measured. A standardized, end-to-end efficiency evaluation is missing.
Stopping policy and uncertainty: The reflection step lacks a principled stopping rule or confidence calibration. How to decide when to answer versus continue exploration under uncertainty (and budget constraints) requires investigation.
Robustness to noise and tool failures: Sensitivity to compression artifacts, timestamp misalignment, dropped frames, or faulty tool outputs is not studied. Failure-mode analysis and robustness benchmarks are needed.
Comparative ablations: The paper does not isolate the contribution of individual tool knobs (nframes vs resize) or rigorously compare planning-before-perception against perception-first baselines under matched token/latency budgets.
Task coverage: Evaluation focuses on QA. Transfer to tasks requiring temporal grounding, dense captioning, summarization, action segmentation, or spatiotemporal detection remains untested.
Grounding metrics: There is no direct evaluation of whether selected frames genuinely support answers (e.g., temporal localization precision/recall, grounding fidelity). Developing and reporting grounding-specific metrics would enable better diagnosis.
Stability and sensitivity: The impact of key RL hyperparameters (e.g., KL coefficient, rollout count), initialization, and random seeds on convergence and final performance is not reported. Stability across runs and ablation of KTO vs alternative preference/RL methods remain open.
Model/backbone generality: Experiments use a single base model (Qwen2.5-VL-7B). How EVA’s policy and training pipeline transfer across model sizes and families (and whether larger vision encoders reduce reliance on tools) is not explored.
Dataset availability and reproducibility: While code/model are linked, it is unclear whether EVA-SFT/KTO/RL datasets (including teacher prompts/trajectories) will be released with licensing, curation, and deduplication details necessary for reproducibility.
Compute requirements: Training uses substantial resources (e.g., 32×H100 with multiple rollouts). Strategies for low-resource training, sample efficiency, and cost–performance trade-offs are not examined.
Action space limitations: EVA cannot explicitly crop regions, track entities across time, or request specialized analyses (OCR, face/pose tracking). Assessing the marginal gains of expanding the action space is an open design question.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

These applications can be prototyped or deployed with current capabilities (planning-before-perception, adjustable frame/time/resolution tool calls, and the SFT–KTO–GRPO training pipeline) while keeping a human-in-the-loop for high-stakes use.

Media, Software, and Content Workflows

Video QA and navigation assistant for editors and analysts
- Sector: Media production, sports, advertising, UX research
- What it does: Ask free-form questions (“When does player 12 score?”, “Find all wide shots of the skyline”) over long footage; EVA scans at low-res/high-fps for an overview, then zooms into relevant segments with higher resolution and fps, returning precise timestamps and evidence frames.
- Tools/products/workflows: NLE plug‑ins (Premiere/Final Cut/DaVinci), searchable “video QA” API with evidence-linked answers, highlight-reel pre-cuts, timeline auto-annotations.
- Assumptions/dependencies: GPU-backed inference; access to frame extraction/transcoding APIs; acceptance of ~55–68% benchmark accuracy implies human verification for critical edits.
Efficient lecture and tutorial indexing
- Sector: Education, corporate L&D
- What it does: Answer “Where is backprop explained?” or “Show the moment the code demo fails,” creating study bookmarks and summaries across multi-hour lectures.
- Tools/products/workflows: LMS integration; study-guide generator that logs evidence frames and timestamps; instructor dashboard for Q&A gaps.
- Assumptions/dependencies: Access to long-form lecture recordings; domain adaptation for technical content.
Screen-recording and bug triage assistant
- Sector: Software engineering, QA
- What it does: Locate “first error dialog,” “when did FPS drop below 30,” or “where was the misclick,” over hours of test videos.
- Tools/products/workflows: CI pipeline hook; Jira/GitHub bot that attaches evidence frames and steps; build gate for regression detection.
- Assumptions/dependencies: Screen capture quality; custom prompts/rewards for domain metrics (e.g., OCR checks).

Security, Retail, and Industrial Operations

Post-hoc incident search on CCTV and dashcams
- Sector: Security, insurance, transportation
- What it does: Query “any person entering emergency exit,” “vehicle lane change without signal,” or “pedestrian near crosswalk at night” and retrieve evidence without scanning every frame.
- Tools/products/workflows: VMS plug‑in for query-driven triage; insurer dashboard for crash forensics; FOIA/claims pre-screening packs with evidence frames.
- Assumptions/dependencies: Privacy/compliance governance; storage/compute budgets; accuracy limits necessitate human review.
Retail shelf audit and compliance checks (offline)
- Sector: Retail
- What it does: Find “out-of-stock moments,” “planogram violations,” and “open fridge doors > 30s” across long recordings.
- Tools/products/workflows: Audit reports with timestamped clips; exception queues for store associates.
- Assumptions/dependencies: Camera placement/quality; store policy integration; potential domain-specific fine-tuning.
Assembly-line quality checks from recorded video
- Sector: Manufacturing
- What it does: Identify anomalies (“missing screws on station 3,” “misalignment during press fit”) via targeted zoom after low-res sweeps.
- Tools/products/workflows: MES integration; shift-summaries with evidence; retraining on failure modes using KTO/GRPO loops.
- Assumptions/dependencies: Consistent lighting/angles; data security; curated failure exemplars.

Research and Academia

Reproducible agentic video RL baseline
- Sector: AI research, academia
- What it does: Provides an end-to-end framework (SFT → KTO → GRPO), tool schemas (start_time, end_time, nframes, resize), and datasets (EVA-SFT/KTO/RL) for studying planning-before-perception at scale.
- Tools/products/workflows: Baseline repo; ablation-ready datasets; reward shaping templates (CSV, ROUGE, format reward).
- Assumptions/dependencies: Access to teacher MLLMs for data generation; compute (multi-GPU) for GRPO.

Policy and Governance Pilots

Body-cam and traffic-cam triage for requests and audits (offline)
- Sector: Public safety, municipal services
- What it does: Speed FOIA responses by auto-locating requested events with evidence frames and a reasoning log (summary–plan–action–reflection).
- Tools/products/workflows: Evidence-linked reports; redaction pipeline integration (low-res sweeps first).
- Assumptions/dependencies: Privacy/chain-of-custody compliance; human oversight; policy for acceptable error rates.

Daily Life

Home security and personal media search
- Sector: Consumer
- What it does: “When did the package arrive?,” “Find moments the dog jumped on the couch,” “Locate the child’s first steps.”
- Tools/products/workflows: Smart-camera app feature; personal archival search with timestamped clips.
- Assumptions/dependencies: On-device or private-cloud compute; consent and privacy settings.
Dashcam event helper
- Sector: Consumer auto
- What it does: After a trip, automatically locate hard brakes, cut-ins, or near-misses and produce a concise, evidence-backed montage.
- Tools/products/workflows: Mobile companion app; claims export.
- Assumptions/dependencies: GPS/IMU sync helpful but not required; model robustness to motion blur/night scenes.

Long-Term Applications

These applications are feasible with further research, domain-specific datasets, stronger reliability guarantees, or real-time/edge scaling.

Healthcare and Life Sciences

Surgical and endoscopic video decision support
- Sector: Healthcare
- What it could do: Detect critical steps/events (bleeding onset, instrument insertion, polyp detection) in long procedures; summarize with evidence.
- Tools/products/workflows: OR integration; post-op review dashboards; training feedback loops using KTO on error trajectories.
- Assumptions/dependencies: Rigorous clinical validation; FDA/CE approval; de-identification and PHI safeguards; high-quality annotated datasets.
Longitudinal patient activity monitoring
- Sector: Digital health, elder care
- What it could do: Detect risky events (falls, wandering) from long in-home streams with compute-aware attention.
- Tools/products/workflows: Edge devices with low-res overview + targeted high-res capture; alert systems with evidence frames.
- Assumptions/dependencies: Strong privacy guarantees; on-device acceleration; reliable fall/anomaly labels.

Robotics and Autonomy

Embodied agents with active visual attention
- Sector: Robotics, logistics, home robotics
- What it could do: Use planning-before-perception to decide when/where/how to look (camera pan/zoom/fps), saving energy and bandwidth while executing tasks.
- Tools/products/workflows: Policy distillation from EVA loops; integration with perception stacks; self-improvement via Data-Enhanced GRPO on failure cases.
- Assumptions/dependencies: Real-time constraints; sim-to-real transfer; safety and calibration.
ADAS/AV log mining and incident forecasting
- Sector: Automotive
- What it could do: Efficient analysis of petabytes of driving video logs; “find all unprotected left turns with pedestrians.”
- Tools/products/workflows: Fleet analytics; active curation for training datasets; QA validation with evidence.
- Assumptions/dependencies: Multi-sensor alignment; strict reliability thresholds; privacy and location compliance.

Smart Cities, Energy, and Infrastructure

Real-time incident detection under compute budgets
- Sector: Smart city ops, traffic management, public safety
- What it could do: Live streams triaged by low-res sweeps and zoom-in sampling to flag accidents, congestion, or crowding.
- Tools/products/workflows: VMS add-ons; city dashboards; escalation with evidence clips.
- Assumptions/dependencies: Streaming inference at scale; SLAs for latency/recall; edge/cloud orchestration.
Industrial inspection via drones and robots
- Sector: Energy, utilities, construction, agriculture
- What it could do: Query-driven inspection of long flights (“any corrosion on joints?”, “cracks on turbine blade?”), minimizing bandwidth and compute.
- Tools/products/workflows: Flight post-processing; autonomous waypoint re-tasking based on EVA’s plan/reflect loop.
- Assumptions/dependencies: Domain-specific visual cues; harsh conditions; integration with maintenance CMMS.

Media and Broadcast

Live highlight generation and compliance monitoring
- Sector: Sports broadcast, news, streaming
- What it could do: On-air event detection (goals, fouls, ads rule compliance) with evidence-linked clips and low computational footprint.
- Tools/products/workflows: Broadcast control-room assistants; automated clipping and metadata generation.
- Assumptions/dependencies: Real-time guarantees; league/policy constraints; robustness to camera switching.

Privacy-Preserving and Edge AI

On-device video agents with budgeted attention
- Sector: Consumer devices, IoT
- What it could do: Always-on but compute-aware assistants that watch “how” and “when” to look, storing only evidence snippets.
- Tools/products/workflows: Mobile/edge accelerators; low-res-first pipelines; encrypted evidence buffers.
- Assumptions/dependencies: Efficient small models; hardware acceleration; local reward proxies for continual learning.

Legal, Compliance, and Governance

Evidence-auditable AI video analysis
- Sector: Legal, compliance
- What it could do: Produce answers with verifiable provenance (timestamps, selected frames, plan/reflect logs) as part of admissible audit trails.
- Tools/products/workflows: “Evidence mode” enforcing CSV-like verification; immutable logs; policy templates.
- Assumptions/dependencies: Standards for AI evidence; explainability requirements; human review mandates.

Foundation Model and Data Ecosystems

Continual domain dataset generation via Data-Enhanced GRPO
- Sector: AI tooling, MLOps
- What it could do: Close the loop on failure cases by auto-generating new QA tasks and training data for niche domains (sports, surgery, retail).
- Tools/products/workflows: Dataset curation service; RL pipelines with composite rewards and anti-reward-hacking safeguards.
- Assumptions/dependencies: High-quality teacher models; governance for synthetic data; monitoring for distribution shift and bias.

Notes on feasibility across applications:

Compute and latency: EVA’s token-efficient planning-before-perception reduces cost, but multi-round tool calls add orchestration overhead. Real-time variants need optimized frame extractors and edge acceleration.
Reliability and safety: Current benchmark accuracy (often ~55–68%) necessitates human-in-the-loop or conservative thresholds in safety-critical domains.
Domain adaptation: Many high-value uses require domain-specific rewards, prompts, and curated failure trajectories (KTO/GRPO) to avoid reward hacking and hallucination.
Privacy and regulation: Surveillance and healthcare uses depend on rigorous privacy controls, consent, and regulatory approvals.

View Paper Prompt View All Prompts

Glossary

Agentic Video Understanding: A paradigm where video models act as agents that plan and use tools to explore video content actively rather than passively processing frames. "agentic video understanding methods enable MLLM-based agents to actively explore video content using external tools."
belief state: In an MDP, the agent’s internal representation of what it knows so far (query, history, and gathered evidence). "the agent observes a belief state:"
Cold-Start: An initial training phase or dataset designed to bootstrap core behaviors before more advanced optimization. "Supervised Fine-Tuning (SFT) Cold-Start dataset"
Completeness Self-Verification (CSV) reward: A reward that checks whether the agent identified the correct supporting frames rather than guessing. "we adopt the Completeness Self-Verification (CSV) reward~\cite{pan2025timesearch}"
Data-Enhanced GRPO: A reinforcement learning pipeline that augments training by collecting failures and generating new data iteratively. "we introduce a Data-Enhanced GRPO pipeline."
Direct Preference Optimization (DPO): A preference-learning method that trains models from pairwise comparisons of outputs. "Unlike DPO~\citep{rafailov2023direct}, which requires pairwise preference data"
frame-selection tool: A tool interface that lets the agent choose when and how many frames to sample and at what resolution. "we design a flexible frame-selection tool that allows both temporal and spatial control."
Group Relative Policy Optimization (GRPO): A KL-regularized policy optimization algorithm used to fine-tune policies with reward signals while staying close to a reference model. "We employ Group Relative Policy Optimization (GRPO)~\citep{shao2024deepseekmathpushinglimitsmathematical}, a KL-regularized policy optimization method"
interleaved image–text reasoning: Reasoning that alternates or integrates visual frames and text within the same chain of thought. "tool-call formatting, interleaved imageâtext reasoning, frame-level understanding"
Kahneman–Tversky Optimization (KTO): A preference-learning method that uses single-sample labels to align strategies without requiring paired comparisons. "KahnemanâTversky Optimization (KTO)~\citep{ethayarajh2024ktomodelalignmentprospect} dataset"
KL-regularized policy optimization: Policy optimization with a penalty on divergence from a reference policy to stabilize learning. "a KL-regularized policy optimization method"
LLM As Judge: Using a LLM as an automated evaluator to label or verify agent outputs or trajectories. "we use LLM As Judge to select the trajectories"
long-context challenge: The difficulty of reasoning over very long videos where processing all frames is inefficient. "effectively addressing the long-context challenge in video understanding."
long-horizon video understanding: Tasks requiring reasoning over extended temporal spans and complex sequences of events. "long-horizon video understanding."
Markov Decision Process (MDP): A formal framework for sequential decision-making with states, actions, and rewards. "We formulate the active video understanding problem as a Markov Decision Process (MDP)."
multimodal LLMs (MLLMs): LLMs that jointly process and reason over multiple modalities, such as text and video. "Video understanding with multimodal LLMs (MLLMs) remains challenging"
Multiple-Choice Questions (MCQ): A supervised format where the model selects an answer from predefined options. "with 90\% open-ended QA and 10\% MCQ."
online reinforcement learning: Learning where the policy is updated from trajectories it generates during training. "GRPO is an online reinforcement learning framework"
open-ended QA: Question answering where the model must generate free-form textual answers rather than choose from options. "open-ended video QA pairs"
planning-before-perception: A strategy where the agent devises a plan from the query before consuming visual input. "we advocate a planning-before-perception paradigm"
preference labels: Single-sample signals (e.g., chosen/rejected) indicating which outputs or trajectories are preferable. "KTO only requires single-sample preference labels (âchosenâ or ârejectedâ)."
reward hacking: Exploiting weaknesses in the reward function to get high scores without truly solving the task. "mitigates reward hacking caused by answer guessing"
reward shaping: Modifying or augmenting rewards to guide learning toward desired behaviors. "Reward Shaping"
rollouts: Sequences of actions and observations generated by the current policy during training. "the model generates multiple rollouts by itself"
self-play: Training by interacting with the environment using the model’s own policy, often without external demonstrations. "rather than self-play"
Supervised Fine-Tuning (SFT): Training a model on labeled data to imitate desired behavior before RL. "Supervised Fine-Tuning (SFT) Cold-Start dataset"
teacher MLLM: A stronger or larger MLLM used to synthesize data, provide guidance, or label examples for training. "the teacher MLLM is prompted to produce open-ended QA pairs"
temporal granularity: The fineness of temporal sampling (e.g., number of frames over time) used to perceive motion or events. "with flexible control over spatial resolution and temporal granularity."
temporal window: A selected time interval within a video from which to sample frames. "The start_time and end_time specify the temporal window"
trajectory (in RL): A sequence of states, actions, and outcomes produced by following a policy. "successful and failed strategy trajectories"
visual tokens: Tokenized representations of visual inputs consumed by an MLLM, contributing to context length and cost. "without costing too many visual tokens."
zero-shot: Evaluating a model on tasks it was not specifically trained on, without task-specific fine-tuning. "evaluated in a zero-shot setting"
zoom-in and zoom-out operations: Adjusting spatial resolution to trade detail for token cost when inspecting frames. "zoom-in and zoom-out operations."

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Summary

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Context and Motivation

Agentic Planning-Before-Perception

Reinforcement Learning Pipeline

Frame Selection Schema

Empirical Analysis

Benchmark Results

Ablation and Behavioral Study

Practical and Theoretical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they do it?

The “tool” EVA uses

How EVA was trained (3 stages)

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: KTO (Kahneman–Tversky Optimization)

Stage 3: GRPO (Generalized Reward Policy Optimization)

What did they find?

Why is this important?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Media, Software, and Content Workflows

Security, Retail, and Industrial Operations

Research and Academia

Policy and Governance Pilots

Daily Life

Long-Term Applications

Healthcare and Life Sciences

Robotics and Autonomy

Smart Cities, Energy, and Infrastructure

Media and Broadcast

Privacy-Preserving and Edge AI

Legal, Compliance, and Governance

Foundation Model and Data Ecosystems

Glossary

Open Problems

Continue Learning

Collections

Tweets