CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention (2303.06908v2)

Published 13 Mar 2023 in cs.CV

Abstract: While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage of them explicitly. To this end, we first propose a cross-scale vision transformer, CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the tokens. Moreover, through experiments on CrossFormer, we observe another two issues that affect vision transformers' performance, i.e., the enlarging self-attention maps and amplitude explosion. Thus, we further propose a progressive group size (PGS) paradigm and an amplitude cooling layer (ACL) to alleviate the two issues, respectively. The CrossFormer incorporating with PGS and ACL is called CrossFormer++. Extensive experiments show that CrossFormer++ outperforms the other vision transformers on image classification, object detection, instance segmentation, and semantic segmentation tasks. The code will be available at: https://github.com/cheerss/CrossFormer.

Authors (8)

Wenxiao Wang (63 papers)
Wei Chen (1290 papers)
Qibo Qiu (11 papers)
Long Chen (395 papers)
Boxi Wu (36 papers)
Binbin Lin (50 papers)
Xiaofei He (70 papers)
Wei Liu (1135 papers)

Citations (31)

View on Semantic Scholar

Summary

CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention

The paper presents CrossFormer++, a novel vision transformer architecture designed to leverage cross-scale attention mechanisms. The core idea is to incorporate multi-scale features directly within the transformer layers using two new components: Cross-scale Embedding Layer (CEL) and Long-Short Distance Attention (LSDA). These advancements aim to improve the transformer’s ability to handle visual inputs of varying sizes and complexities.

Key Contributions

Cross-scale Embedding Layer (CEL): CEL is introduced to embed input images at multiple scales. By using convolutional layers with varying kernel sizes, CEL captures both fine and coarse details concurrently, ensuring that multi-scale features are explicitly incorporated into the model.
Long-Short Distance Attention (LSDA): LSDA splits conventional self-attention into two parts: Short Distance Attention (SDA) for local dependencies and Long Distance Attention (LDA) for capturing global interactions. This bifurcation reduces computational complexity while preserving feature relationships across scales.
Progressive Group Size (PGS): Addressing the transformer’s attention map behavior, PGS proposes a dynamic scaling strategy for group sizes within the self-attention blocks. This strategy aligns with the observation that attention maps transition from local to global focus across the network depth.
Amplitude Cooling Layer (ACL): To counteract amplitude explosion—a common issue in deep networks—ACL is designed to stabilize the activation magnitude across layers. This involves lightweight convolutional layers that mitigate extreme variances in activation scales, thereby promoting training stability.
Dynamic Position Bias (DPB): DPB serves as a flexible positional encoding mechanism, adaptable to variable input sizes and group configurations. It enhances the model’s ability to generalize over diverse visual tasks.

Experimental Outcomes

CrossFormer++ exhibits superior performance compared to its predecessors and other contemporaneous vision models. Conducted evaluations on ImageNet demonstrate significant improvements in accuracy across various configurations. In object detection tasks on the COCO dataset, CrossFormer++ consistently outperforms state-of-the-art models like Swin and ViL. Specifically, CrossFormer++ delivers robust performances in dense prediction tasks such as object detection and semantic segmentation, evidencing the efficacy of cross-scale attention.

Implications and Future Directions

The introduction of CEL and LSDA signifies a pivotal step in enhancing transformer architectures for vision tasks. By emphasizing cross-scale interactions, CrossFormer++ aligns closer to the inherent multi-scale nature required in visual perception tasks. Future research may focus on refining automated strategies for parameter adjustments, such as developing adaptive group size policies, potentially through neural architecture search techniques. Additionally, the integration of self-supervised pretraining schemes could further amplify the model’s applicability and performance.

CrossFormer++’s adaptability and efficiency hold substantial promise for increasingly complex vision applications, suggesting potential extensions into real-time processing domains and mobile computing platforms, where computational resources are constrained.

PDF Markdown

Related Papers

PaLI-3 Vision Language Models: Smaller, Faster, Stronger (2023)
Inception Transformer (2022)
Visformer: The Vision-friendly Transformer (2021)
CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention (2021)
Emu3: Next-Token Prediction is All You Need (2024)

GitHub

GitHub - cheerss/CrossFormer: The official code for the paper: https://openreview.net/forum?id=_PHymLIxuI (351 stars)