TransNeXt: Robust Foveal Visual Perception for Vision Transformers (2311.17132v3)

Published 28 Nov 2023 in cs.CV and cs.AI

Abstract: Due to the depth degradation effect in residual connections, many efficient Vision Transformers models that rely on stacking layers for information exchange often fail to form sufficient information mixing, leading to unnatural visual perception. To address this issue, in this paper, we propose Aggregated Attention, a biomimetic design-based token mixer that simulates biological foveal vision and continuous eye movement while enabling each token on the feature map to have a global perception. Furthermore, we incorporate learnable tokens that interact with conventional queries and keys, which further diversifies the generation of affinity matrices beyond merely relying on the similarity between queries and keys. Our approach does not rely on stacking for information exchange, thus effectively avoiding depth degradation and achieving natural visual perception. Additionally, we propose Convolutional GLU, a channel mixer that bridges the gap between GLU and SE mechanism, which empowers each token to have channel attention based on its nearest neighbor image features, enhancing local modeling capability and model robustness. We combine aggregated attention and convolutional GLU to create a new visual backbone called TransNeXt. Extensive experiments demonstrate that our TransNeXt achieves state-of-the-art performance across multiple model sizes. At a resolution of $224^2$, TransNeXt-Tiny attains an ImageNet accuracy of 84.0%, surpassing ConvNeXt-B with 69% fewer parameters. Our TransNeXt-Base achieves an ImageNet accuracy of 86.2% and an ImageNet-A accuracy of 61.6% at a resolution of $384^2$, a COCO object detection mAP of 57.1, and an ADE20K semantic segmentation mIoU of 54.7.

References (75)

Citations (31)

View on Semantic Scholar

Summary

The paper presents a novel foveal-inspired design that replaces deep layer stacking with aggregated attention for improved global perception.
The model integrates convolutional GLU for dynamic channel mixing, boosting local feature extraction and overall robustness.
TransNeXt achieves state-of-the-art performance, with TransNeXt-Tiny reaching 84.0% and TransNeXt-Base 86.2% accuracy on ImageNet-1K and robust results on adverse tests.

Overview of "TransNeXt: Robust Foveal Visual Perception for Vision Transformers"

This paper introduces a novel visual backbone architecture termed "TransNeXt," which is specifically designed to enhance the robustness and efficacy of Vision Transformers (ViTs) in computer vision tasks. The core advancement in the TransNeXt model lies in the integration of a biomimetic design that simulates biological foveal vision. This design is facilitated through the implementation of Aggregated Attention and Convolutional GLU, each serving as a pivotal component in the token and channel mixing processes, respectively.

The authors address a critical limitation inherent in many Vision Transformer models—depth degradation, which typically arises from stacking layers for information exchange. To circumvent this, TransNeXt employs a non-stacking approach by utilizing Aggregated Attention, which draws inspiration from biological vision. This mechanism effectively mimics human eye movements and enables tokens on the feature map to achieve global perception. It introduces learnable tokens to generate diversified affinity matrices, surpassing traditional query-key similarity methods. The aggregation of diverse attention mechanisms within a single attention layer fosters a more comprehensive information mixing, thereby mitigating potential depth degradation without the need for extensive stacking of layers.

Convolutional GLU presents another innovation by combining Gated Linear Units (GLUs) and the Squeeze-and-Excitation (SE) mechanism. This channel mixer augments local modeling capabilities and model robustness by embedding channel attention based on surrounding image features, thus providing a dynamic and adaptive approach to feature extraction.

The effectiveness of TransNeXt is demonstrated through extensive experiments across various tasks, such as image classification, object detection, and semantic segmentation. Notably, TransNeXt models achieve state-of-the-art results on multiple benchmarks. TransNeXt-Tiny, for example, achieves an impressive ImageNet-1K accuracy of 84.0% with substantially fewer parameters than its counterparts. Even more remarkable is the ability of TransNeXt-Base to attain 86.2% accuracy on ImageNet-1K and 61.6% on the challenging ImageNet-A, highlighting the model's robustness across standard and adverse test conditions.

The architectural design choices in TransNeXt emphasize the influence of human visual modeling in enhancing artificial neural network performance. By closely aligning the attention mechanisms with foveal vision principles, the model achieves natural visual perception and effectively circumvents common challenges associated with deep-layer stacking, such as information distortion and inefficiencies.

The implications of these findings extend beyond just the architectural design improvements. They underscore the potential of biologically inspired models to bridge gaps in understanding visual perception in AI systems. Future developments could involve further exploration of biologically inspired mechanisms for more comprehensive multisensory and adaptive perception in artificial models.

Ultimately, the paper contributes significantly to the landscape of Vision Transformers, offering novel insights into architecture design that balance robustness, efficiency, and computational cost. The introduction of TransNeXt paves the way for more scalable and adaptable visual models, offering a substantial benchmark for future research in enhancing AI's visual processing capabilities.

PDF Markdown

Tweets

https://twitter.com/bokeyuu/status/1750947675967791533

YouTube

Show All Videos

TransNeXt: Robust Foveal Visual Perception for Vision Transformers (2311.17132v3)

Summary

Overview of "TransNeXt: Robust Foveal Visual Perception for Vision Transformers"

Related Papers

Tweets

YouTube