Partial success in closing the gap between human and machine vision

Published 14 Jun 2021 in cs.CV, cs.AI, cs.LG, and q-bio.NC | (2106.07411v2)

Abstract: A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines "in the wild" and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along three axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (e.g. vision transformers), and dataset size (ranging from 1M to 1B). Our findings are threefold. (1.) The longstanding distortion robustness gap between humans and CNNs is closing, with the best models now exceeding human feedforward performance on most of the investigated OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorisation errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude. Our results give reason for cautious optimism: While there is still much room for improvement, the behavioural difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioural data and evaluation code are provided as a toolbox and benchmark at: https://github.com/bethgelab/model-vs-human/

Abstract PDF Upgrade to Chat

Citations (195)

View on Semantic Scholar

Summary

The paper demonstrates that large-scale training narrows the distortion robustness gap, with models like Noisy Student outperforming humans on challenging OOD datasets.
The research shows that despite high accuracy, machine error patterns diverge from human perceptual errors, indicating a persistent consistency gap.
The study highlights that both dataset size and advanced architectures, such as vision transformers, are key to advancing machine vision robustness and human-like performance.

Analyzing "Partial Success in Closing the Gap Between Human and Machine Vision"

The paper "Partial Success in Closing the Gap Between Human and Machine Vision" by Geirhos et al. explores the enduring challenge of aligning the performance and behaviors of artificial neural networks, particularly CNNs, with those of the human visual system. Using a comprehensive and systematic approach, the authors explore whether recent advancements in machine learning have made strides towards this goal, primarily focusing on out-of-distribution (OOD) robustness and consistency with human error patterns.

Methodology

The authors conducted extensive psychophysical experiments involving 85,120 trials with 90 human participants. These trials tested human visual recognition across 17 OOD datasets designed to challenge ImageNet-trained models on various image distortions such as stylized images, edge-filtered images, and synthetic noise. For machine comparisons, they evaluated a diverse set of 52 models, deviating from standard supervised CNNs along three key axes: objective function, architecture, and dataset size.

Core Findings

Distortion Robustness Gap: The paper reveals that the distortion robustness gap between CNNs and human vision is narrowing. The models trained on large datasets (ranging from 14M to 1B images) often exceed human performance on several OOD datasets. This is particularly evident in models such as Noisy Student (trained on 300M images) and CLIP (trained on 400M images), which exhibit superior OOD generalization compared to standard ImageNet models.
Consistency Gaps: However, aligning model errors with human perceptual errors remains challenging. The study finds that while models can achieve high accuracies, their error patterns still significantly diverge from those of humans. Only models trained on extensive data sets show a tendency toward human-like error patterns.
Impact of Dataset Size: A crucial determinant in bridging the performance gap was the size of the dataset used for training. Models trained on significantly larger datasets displayed enhanced robustness and a reduction in the error consistency gap with humans.
Role of Architecture and Training Objectives: Vision transformers, when combined with extensive datasets, showed improvements in robustness and even human-like error consistency. However, advances in architecture or training objectives, such as self-supervised or adversarially trained models, did not alone yield a substantial alignment with human error patterns.

Implications and Future Directions

The progress highlighted in closing the robustness gap suggests practical implications for deploying machine vision models in real-world settings where conditions are often unpredictable and varied. Despite this, the substantial consistency gap indicates that current models may still rely on learning shortcuts that fail to generalize such as those employed by human vision.

Future efforts might benefit from a focus on understanding and incorporating the underlying processes of human error patterns into AI models. This could involve investigating the perceptual cues and strategies employed by humans that are absent in machine learning algorithms. Furthermore, the dependency on large-scale datasets raises questions about the accessibility of such resources and the environmental impact of extensive computation.

Conclusion

Geirhos et al.'s work highlights significant milestones and continuing challenges in aligning human and machine vision. While there is a trend toward improved OOD robustness, the path to true human-like perception remains complex and multifaceted. The study provides essential insights and benchmarks that pave the way for future explorations in narrowing the gap further, promising cautious optimism for developments in both theoretical understandings and applied technologies.