Evaluating Multiview Object Consistency in Humans and Image Models (2409.05862v2)

Published 9 Sep 2024 in cs.CV

Abstract: We introduce a benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task. We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape: given a set of images, participants identify which contain the same/different objects, despite considerable viewpoint variation. We draw from a diverse range of images that include common objects (e.g., chairs) as well as abstract shapes (i.e., procedurally generated `nonsense' objects). After constructing over 2000 unique image sets, we administer these tasks to human participants, collecting 35K trials of behavioral data from over 500 participants. This includes explicit choice behaviors as well as intermediate measures, such as reaction time and gaze data. We then evaluate the performance of common vision models (e.g., DINOv2, MAE, CLIP). We find that humans outperform all models by a wide margin. Using a multi-scale evaluation approach, we identify underlying similarities and differences between models and humans: while human-model performance is correlated, humans allocate more time/processing on challenging trials. All images, data, and code can be accessed via our project page.

PDF Abstract

Evaluating Multiview Object Consistency in Humans and Image Models

The paper "Evaluating Multiview Object Consistency in Humans and Image Models" presents a comprehensive benchmark designed to directly evaluate the alignment between human perception and computer vision models in the context of 3D shape inference tasks. Leveraging a dataset (MOCHI), the authors investigate how well state-of-the-art vision models such as DINOv2, MAE, and CLIP align with human performance on tasks requiring the differentiation of objects from various viewpoints.

Experimental Design and Data Collection

The experimental design involves a zero-shot visual inference task where participants must identify matching or differing objects from triplets of images. The task scales in difficulty by varying object similarity and viewpoint variation. Two variants of this task are employed: an 'odd-one-out' task where participants select the differing object among triplets and a 'match-to-sample' task where participants identify the object matching a given sample image. The stimuli are sourced from four distinct datasets—'barense,' 'majaj,' 'shapenet,' and 'shapegen'—each bringing unique object categories and conditions that range in perceptual complexity and similarity.

Human data were collected across over 35,000 trials from more than 500 participants, enabling a robust evaluation of human accuracy, reaction times, and gaze patterns. This extensive dataset provides explicit behavioral measures and intermediate metrics crucial for comparing human performance against model performance.

Model Evaluation Methods

The vision models evaluated include self-supervised models DINOv2, autoencoding models (MAE), and vision-LLMs (CLIP). Models' performance was measured using various metrics, including zero-shot distance metrics and linear probe strategies. The most effective evaluation was achieved through a same-different classifier, which significantly outperformed other methods, providing a nuanced readout of model accuracy in identifying the 'odd-one-out' in the image triplets.

Results

Human Performance

Humans achieved an average accuracy of 78%, with notable reliability in their performance across different experimental conditions. Reaction times correlated with task difficulty, indicating that humans allocate more processing time for more challenging tasks. Gaze patterns, captured through eye-tracking, showed reliable attention to task-relevant features, suggesting a consistent approach to visual processing among participants.

Model Performance

None of the evaluated vision models reached human-level performance. DINOv2-Giant emerged as the best-performing model with an accuracy of 44%, significantly lower than human performance. MAEs performed around chance levels, indicating poor alignment with human shape inference tasks. Increasing model scale improved performance for DINOv2 and CLIP but did not mitigate the substantial performance gap between humans and models.

Correlation and Divergence

While human-model performance correlation was observed, indicating shared task difficulty, the gap between human and model accuracy varied significantly across conditions. The trials posing the greatest challenge to models were those where humans showed increased reaction times and more focused gaze patterns, suggesting different processing strategies. Models' attention mechanisms failed to replicate the reliable human gaze patterns, highlighting a fundamental divergence in visual processing.

Implications and Future Directions

The results underscore the critical need for improving computer vision models' capabilities to approximate dynamic and sequential human visual processing. The authors' benchmarks not only highlight the performance disparities but also provide insight into the algorithmic underpinnings of human visual abilities. Future research could explore integrating these behavioral insights into model training protocols, potentially enhancing model robustness and generalizability in complex visual tasks.

The paper's contributions are substantial, establishing a nuanced evaluation framework and delineating clear performance metrics that could guide the development of next-generation vision models. By offering a detailed examination of human-model alignment, this work sets the stage for more refined and human-aligned computer vision methodologies.

Data and Resources

All data, code, and stimuli used in this paper are openly accessible under a CC BY-NC-SA 4.0 license. These resources are available on the project page and through GitHub repositories, providing transparency and enabling further research and replication studies in this domain.

In conclusion, the paper offers a rigorous framework for benchmarking vision models against human perception, paving the way for future advancements in computer vision that are more closely aligned with the complexities of human visual cognition.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Tyler Bonnen (4 papers)
Stephanie Fu (11 papers)
Yutong Bai (32 papers)
Thomas O'Connell (1 paper)
Yoni Friedman (4 papers)
Nancy Kanwisher (5 papers)
Joshua B. Tenenbaum (257 papers)
Alexei A. Efros (100 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/tylerraye/status/1834633835616907709