Diagnostic Benchmark for Multimodal Video Models: The Perception Test
This paper introduces the Perception Test, a novel and comprehensive benchmark designed for assessing the perception and reasoning skills of pre-trained multimodal models, specifically those that process video data such as Flamingo, SeViLA, and GPT-4. Unlike existing benchmarks that focus on discrete computational tasks, the Perception Test evaluates models across a range of perceptual skills and reasoning types in video, audio, and text modalities, which include Memory, Abstraction, Physics, and Semantics, with additional dimensions of descriptive, explanatory, predictive, and counterfactual reasoning.
Dataset Composition
The Perception Test encompasses 11,600 real-world videos, averaging 23 seconds in length, filmed by approximately 100 participants globally. The videos are strategically designed to exhibit perceptually complex scenarios that necessitate sophisticated reasoning capabilities. Each video is densely annotated with six types of labels including object and point tracks, temporal action and sound segments, and both multiple-choice and grounded video question-answers. This detailed annotation enables evaluations at both the semantic level and non-language levels, encouraging diverse assessments of model capabilities.
Evaluation Framework
Evaluation using the Perception Test can occur under zero-shot, few-shot, or limited fine-tuning regimes, challenging models to leverage their transfer capabilities effectively. A public dataset allows for fine-tuning and validation, while a challenge server offers access to a held-out test split, thus facilitating robust evaluation of models' generalization abilities. The benchmark acknowledges the significant gap between current human baseline performance (91.4% correct on video QA tasks) and that of state-of-the-art models (46.2%), highlighting considerable room for improvement in current approaches to multimodal video understanding.
Implications and Future Developments
This paper highlights the potential of the Perception Test to drive improvements in general-purpose multimodal models that must effectively integrate information across disparate modalities and reasoning types. The diagnostic nature of the benchmark, inspired by human behavioral tests, presents an opportunity to not only assess but also direct the development of sophisticated reasoning models that can operate effectively across complex real-world scenarios. Moving forward, researchers might explore augmenting the Perception Test with additional modalities such as tactile data or expand its applications to evolving domains like robotics, where comprehensive understanding across multiple modalities is crucial.
Overall, the Perception Test represents a significant step in multimodal AI, prioritizing the skills necessary for more human-like perception and reasoning. As models continue to evolve, it will be critical to develop analogous benchmarks that push the boundaries of what these systems can achieve, both in terms of perception acuity and in the scope and scale of reasoning they can undertake.
Note: The dataset, baseline code, and further information can be accessed via the authors' repository at Perception Test on GitHub.