Perception Test: A Diagnostic Benchmark for Multimodal Video Models (2305.13786v2)

Published 23 May 2023 in cs.CV, cs.AI, and cs.LG

Abstract: We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities, to provide a comprehensive and efficient evaluation tool. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime. For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a substantial gap in performance (91.4% vs 46.2%), suggesting that there is significant room for improvement in multimodal video understanding. Dataset, baseline code, and challenge server are available at https://github.com/deepmind/perception_test

PDF Abstract

Diagnostic Benchmark for Multimodal Video Models: The Perception Test

This paper introduces the Perception Test, a novel and comprehensive benchmark designed for assessing the perception and reasoning skills of pre-trained multimodal models, specifically those that process video data such as Flamingo, SeViLA, and GPT-4. Unlike existing benchmarks that focus on discrete computational tasks, the Perception Test evaluates models across a range of perceptual skills and reasoning types in video, audio, and text modalities, which include Memory, Abstraction, Physics, and Semantics, with additional dimensions of descriptive, explanatory, predictive, and counterfactual reasoning.

Dataset Composition

The Perception Test encompasses 11,600 real-world videos, averaging 23 seconds in length, filmed by approximately 100 participants globally. The videos are strategically designed to exhibit perceptually complex scenarios that necessitate sophisticated reasoning capabilities. Each video is densely annotated with six types of labels including object and point tracks, temporal action and sound segments, and both multiple-choice and grounded video question-answers. This detailed annotation enables evaluations at both the semantic level and non-language levels, encouraging diverse assessments of model capabilities.

Evaluation Framework

Evaluation using the Perception Test can occur under zero-shot, few-shot, or limited fine-tuning regimes, challenging models to leverage their transfer capabilities effectively. A public dataset allows for fine-tuning and validation, while a challenge server offers access to a held-out test split, thus facilitating robust evaluation of models' generalization abilities. The benchmark acknowledges the significant gap between current human baseline performance (91.4% correct on video QA tasks) and that of state-of-the-art models (46.2%), highlighting considerable room for improvement in current approaches to multimodal video understanding.

Implications and Future Developments

This paper highlights the potential of the Perception Test to drive improvements in general-purpose multimodal models that must effectively integrate information across disparate modalities and reasoning types. The diagnostic nature of the benchmark, inspired by human behavioral tests, presents an opportunity to not only assess but also direct the development of sophisticated reasoning models that can operate effectively across complex real-world scenarios. Moving forward, researchers might explore augmenting the Perception Test with additional modalities such as tactile data or expand its applications to evolving domains like robotics, where comprehensive understanding across multiple modalities is crucial.

Overall, the Perception Test represents a significant step in multimodal AI, prioritizing the skills necessary for more human-like perception and reasoning. As models continue to evolve, it will be critical to develop analogous benchmarks that push the boundaries of what these systems can achieve, both in terms of perception acuity and in the scope and scale of reasoning they can undertake.

Note: The dataset, baseline code, and further information can be accessed via the authors' repository at Perception Test on GitHub.

PDF Markdown Bookmark Chat (Pro)

Authors (24)

Viorica Pătrăucean (17 papers)
Lucas Smaira (9 papers)
Ankush Gupta (19 papers)
Adrià Recasens Continente (1 paper)
Larisa Markeeva (12 papers)
Dylan Banarse (9 papers)
Skanda Koppula (23 papers)
Joseph Heyward (9 papers)
Mateusz Malinowski (41 papers)
Yi Yang (855 papers)
Carl Doersch (34 papers)
Tatiana Matejovicova (4 papers)
Yury Sulsky (4 papers)
Antoine Miech (23 papers)
Alex Frechette (1 paper)
Hanna Klimczak (2 papers)
Junlin Zhang (19 papers)
Stephanie Winkler (4 papers)
Yusuf Aytar (36 papers)
Simon Osindero (45 papers)

Citations (76)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - google-deepmind/perception_test (190 stars)