EVA: Efficient Reinforcement Learning for End-to-End Video Agent
This presentation explores EVA, a breakthrough framework that transforms video understanding from passive observation into active, query-driven reasoning. By planning before perceiving, EVA achieves state-of-the-art accuracy on long-video question answering tasks while using 10 times fewer visual tokens than traditional methods. Through a three-stage reinforcement learning pipeline combining supervised fine-tuning, preference learning, and policy optimization, EVA learns to strategically allocate its attention across videos, performing coarse global scans followed by precise high-resolution retrieval only where needed.Script
What if your video AI could decide what to watch and when to watch it, instead of blindly processing every frame? Traditional video models drown in visual tokens, but EVA flips the script entirely.
EVA reframes video understanding as an iterative process: summarize, plan, act, and reflect. Instead of encoding thousands of frames upfront, the agent reasons about the query first, then strategically extracts only the visual information it needs, with complete control over temporal windows, frame counts, and spatial resolution.
This autonomy emerges through a carefully designed reinforcement learning pipeline.
The training unfolds in three acts. Supervised fine-tuning induces cold-start competence in tool calling and multi-turn reasoning. Kahneman Tversky Optimization then stabilizes the policy by learning from both successes and failures. Finally, Generalized Reward Policy Optimization refines the agent online, using composite rewards that balance accuracy with response quality, yielding policies that generalize robustly to unseen queries.
The results reveal something striking. Across six long-video benchmarks, EVA establishes new state-of-the-art performance while consuming a fraction of the visual tokens used by competing models. On LSDBench, it reaches 51.8 percent accuracy with just 6,200 tokens, outperforming models that process 21,000 tokens. The agent distributes its visual budget adaptively: initial rounds perform fast, low-resolution global scans, then zoom in with high resolution exactly where the query demands it. This isn't brute force; it's strategic allocation learned through reinforcement.
The implications extend beyond benchmarks. EVA resolves the fundamental sampling versus reasoning trade-off for long-horizon video understanding. By integrating planning directly into perception, it offers a scalable blueprint for agentic multimodal systems that must handle arbitrary queries, extended contexts, and variable computational budgets. The framework proves that active, query-driven reasoning outperforms both imitation-only training and rigid tool-based pipelines.
EVA demonstrates that teaching video agents to plan before they perceive unlocks both intelligence and efficiency. To explore more cutting-edge research like this and create your own video presentations, visit EmergentMind.com.