Attribution of performance gains in video-LLMs and agentic frameworks

Determine whether the performance gains reported for video large language models and agentic frameworks on benchmarks such as EgoSchema, LongVideoBench, VideoMME, and MLVU primarily arise from grounded visual perception, from purely linguistic reasoning, or from background knowledge priors.

Background

Recent works have combined video-LLMs with agentic frameworks, showing rapid progress on several popular benchmarks. However, the observed gains may stem from different sources, including visual perception, language-only reasoning, or external knowledge priors.

The paper emphasizes the need to disentangle these factors to accurately attribute improvements and guide future model and benchmark design, motivating the development of diagnostic tools such as Video-Oasis.

References

While the synergy between video-LLMs and agentic frameworks has led to rapid gains on existing benchmarks, it remains unclear whether these gains stem from visual perception, linguistic reasoning, or knowledge priors.

Video-Oasis: Rethinking Evaluation of Video Understanding  (2603.29616 - Lim et al., 31 Mar 2026) in Section 2.1, Large Language Models for Video Understanding