ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness (1811.12231v3)

Published 29 Nov 2018 in cs.CV, cs.AI, cs.LG, q-bio.NC, and stat.ML

Abstract: Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies suggest a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies. We then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on "Stylized-ImageNet", a stylized version of ImageNet. This provides a much better fit for human behavioural performance in our well-controlled psychophysical lab setting (nine experiments totalling 48,560 psychophysical trials across 97 observers) and comes with a number of unexpected emergent benefits such as improved object detection performance and previously unseen robustness towards a wide range of image distortions, highlighting advantages of a shape-based representation.

Citations (2,503)

View on Semantic Scholar

Summary

The paper demonstrates that ImageNet-trained CNNs primarily rely on texture cues over shape, contrasting with human visual preference.
Experiments using Stylized-ImageNet increased shape bias in ResNet-50 from 22% to 81%, significantly reducing error rates on distorted images.
The findings imply that emphasizing shape bias in training improves model robustness and accuracy, benefiting real-world applications.

ImageNet-trained CNNs are Biased Towards Texture; Increasing Shape Bias Improves Accuracy and Robustness

The paper "ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness" by Robert Geirhos et al. makes a significant contribution to our understanding of Convolutional Neural Networks (CNNs) and their inherent biases. Through a series of thorough experiments comparing human and CNN object recognition, the authors highlight that ImageNet-trained CNNs tend to rely more on textures rather than shapes, which contrasts with human visual processing that prioritizes shapes. This paper's insights into CNN biases hold important implications for the fields of computer vision and neural computation, suggesting opportunities for enhancing model robustness and aligning more closely with human perception.

Key Findings

The authors designed experiments to distinguish between texture-based and shape-based object recognition by using images that contain conflicting shape and texture cues. Their findings are twofold:

Bias Toward Texture Over Shape in ImageNet-trained CNNs: The experiments conducted revealed that CNNs, such as ResNet-50, AlexNet, GoogLeNet, and VGG-16, predominantly rely on texture cues for classification. When presented with images where shape and texture cues conflict, CNNs incorrectly classified objects based on texture more than 75% of the time, whereas humans relied on shape over 95% of the time.
Impact of Shape Bias on Accuracy and Robustness: The researchers created a variant of ImageNet called Stylized-ImageNet (SIN), where textures are replaced with styles from artwork, thus attenuating the texture cues. Training CNNs on SIN resulted in models that exhibited a higher shape bias, reduced error rates on distorted images, and improved object detection performance. Specifically, the ResNet-50 trained on SIN showed an 81% shape bias, a marked increase from the 22% shown by the same model trained on standard ImageNet.

Methodology

The paper presents an elaborate methodology encompassing both human and machine vision experiments:

Psychophysical Experiments: Precise settings of psychophysical trials were utilized to compare human and CNN responses to the same visual stimuli, with 48,560 psychophysical trials across 97 observers. Humans showed a strong shape bias in classification tasks, even when explicitly instructed to focus on texture.
CNN Training and Evaluation Procedures: The authors trained standard CNN architectures on both ImageNet and SIN. The networks' performances were evaluated using the top-5 accuracy metric on validation data, robustness to various distortions (noise, blur, etc.), and transfer learning capabilities to object detection tasks on datasets like Pascal VOC and MS COCO.

Implications and Future Directions

The implications of this research are profound:

Understanding Neural Network Decisions: The realization that CNNs rely heavily on textures informs the development of models that can better mimic human visual strategies by incorporating a shape bias.
Robustness and Accuracy Improvement: Training models on data that diminish the reliance on texture can lead to more robust models, particularly in real-world applications where visual distortions are common. The authors demonstrate that SIN-trained models not only perform better under various distortions but also show improved transfer learning performance on detection tasks.
Applications in Practical AI Systems: Enhanced recognition accuracy and robustness are critical for deploying AI systems in dynamic environments, such as autonomous driving, medical imaging, and surveillance. The findings underscore the necessity for domain-appropriate training datasets that emphasize global shapes over local textures.

Conclusion

The paper by Geirhos et al. systematically dissects the biases within ImageNet-trained CNNs and provides a compelling argument for incorporating shape-biased training methods to enhance model performance. The introduction of Stylized-ImageNet as a training dataset presents a practical tool for this purpose. The study effectively bridges the gap between machine learning models and human visual perception, potentially guiding future research toward more human-like and robust AI systems. The exploration of shape bias in CNNs could pave the way for models with superior generalizability and resilience to real-world conditions, highlighting a pivotal step forward in computer vision research.