Are Convolutional Neural Networks or Transformers more like human vision? (2105.07197v2)

Published 15 May 2021 in cs.CV

Abstract: Modern machine learning models for computer vision exceed humans in accuracy on specific visual recognition tasks, notably on datasets like ImageNet. However, high accuracy can be achieved in many ways. The particular decision function found by a machine learning system is determined not only by the data to which the system is exposed, but also the inductive biases of the model, which are typically harder to characterize. In this work, we follow a recent trend of in-depth behavioral analyses of neural network models that go beyond accuracy as an evaluation metric by looking at patterns of errors. Our focus is on comparing a suite of standard Convolutional Neural Networks (CNNs) and a recently-proposed attention-based network, the Vision Transformer (ViT), which relaxes the translation-invariance constraint of CNNs and therefore represents a model with a weaker set of inductive biases. Attention-based networks have previously been shown to achieve higher accuracy than CNNs on vision tasks, and we demonstrate, using new metrics for examining error consistency with more granularity, that their errors are also more consistent with those of humans. These results have implications both for building more human-like vision models, as well as for understanding visual object recognition in humans.

Authors (4)

Shikhar Tuli (15 papers)
Ishita Dasgupta (35 papers)
Erin Grant (15 papers)
Thomas L. Griffiths (150 papers)

Citations (164)

View on Semantic Scholar

Summary

The paper shows that Vision Transformers align more closely with human error patterns compared to CNNs, emphasizing shape over texture bias.
It employs metrics such as Cohen’s kappa and JS distance on datasets like Stylized ImageNet to quantify discrepancies between model and human errors.
The study reveals that robust data augmentation enhances shape bias, allowing ViTs to maintain high accuracy while mimicking human visual strategies.

Comparing the Human-Likeness of Vision Models: CNNs and Vision Transformers

The paper "Are Convolutional Neural Networks or Transformers more like human vision?" investigates the congruency between human visual perception and two prominent machine learning models: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). These models have been instrumental in achieving impressive accuracies on benchmark datasets like ImageNet. However, high accuracy alone does not necessarily indicate alignment with human visual reasoning, which this paper seeks to address through a detailed analysis of error patterns.

Background and Objective

CNNs have been the dominant architecture for computer vision tasks due to their inspiration from the primate visual system. They power tasks ranging from image classification to facial recognition. Nevertheless, their inclination towards texture-based classification, as revealed in previous studies, marks a divergence from human perception, which tends to focus on shape. This texture bias becomes particularly evident when CNNs struggle with low-detail images such as sketches.

ViTs bring a new paradigm by employing a self-attention mechanism, inspired by their success in natural language processing. Unlike CNNs, ViTs do not inherently enforce local spatial structure, potentially offering more flexibility akin to human visual processing. The authors aim to discern whether this flexibility results in a behavioral resemblance to human vision by comparing CNNs and ViTs on their error profiles.

Methodology

The paper draws upon innovative metrics that go beyond accuracy to evaluate model behavior. Measuring error consistency with human-established ground truths, the authors employ Cohen's $\kappa$ and Jensen-Shannon (JS) distance to assess overlaps and discrepancies in error patterns between the models and human observers. Furthermore, experiments are designed using datasets such as Stylized ImageNet, which introduces texture-shape conflicts, thereby highlighting models' biases.

Results and Analysis

Empirical analysis reveals that ViTs not only surpass CNNs in accuracy but also align more closely with human error patterns in terms of classifying shape over texture. Cohen's $\kappa$ and class-wise JS distance both suggest that ViTs exhibit mistakes that are more consistent with human errors compared to CNNs. This alignment is, however, more nuanced when considering inter-class errors, where a typical CNN may occasionally perform better under specific configurations.

Interestingly, the research also encompasses the impact of data augmentation techniques, which shaped the learning biases of the models significantly. By applying robust augmentations, shape bias increased in both CNNs and ViTs, though ViTs managed this without a substantial loss of accuracy, unlike CNNs. This suggests an inherent advantage in ViTs' ability to adapt human-like visual strategies.

Implications and Future Directions

The findings have multifaceted implications. Designing vision models that mimic human perception can improve the interpretability and reliability of automated systems in complex real-world settings. Leveraging forthcoming research could include detailed explorations into augmentation strategies, potentially integrating human-like inductive biases directly into model architectures.

The paper opens avenues for expanding metrics like JS distance to encapsulate more sophisticated notions of "conceptual understanding," a plausible direction for models that not only classify but also comprehend visual inputs contextually, similar to human cognition. Moreover, these metrics can serve as regularizers during model training for optimizing human-likeness, possibly reducing computational costs along the way.

In conclusion, this research underscores how current vision models, particularly ViTs, have progressed closer alignment with human visual reasoning. Despite ongoing advancements, future work remains essential in fully capturing the intricacies of human-like vision in computational models.

PDF Markdown

Related Papers

YouTube

Show All Videos