Attribution of performance gains in video-LLMs and agentic frameworks
Determine whether the performance gains reported for video large language models and agentic frameworks on benchmarks such as EgoSchema, LongVideoBench, VideoMME, and MLVU primarily arise from grounded visual perception, from purely linguistic reasoning, or from background knowledge priors.
References
While the synergy between video-LLMs and agentic frameworks has led to rapid gains on existing benchmarks, it remains unclear whether these gains stem from visual perception, linguistic reasoning, or knowledge priors.
— Video-Oasis: Rethinking Evaluation of Video Understanding
(2603.29616 - Lim et al., 31 Mar 2026) in Section 2.1, Large Language Models for Video Understanding