Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention (2108.00154v2)

Published 31 Jul 2021 in cs.CV and cs.LG

Abstract: Transformers have made great progress in dealing with computer vision tasks. However, existing vision transformers do not yet possess the ability of building the interactions among features of different scales, which is perceptually important to visual inputs. The reasons are two-fold: (1) Input embeddings of each layer are equal-scale, so no cross-scale feature can be extracted; (2) to lower the computational cost, some vision transformers merge adjacent embeddings inside the self-attention module, thus sacrificing small-scale (fine-grained) features of the embeddings and also disabling the cross-scale interactions. To this end, we propose Cross-scale Embedding Layer (CEL) and Long Short Distance Attention (LSDA). On the one hand, CEL blends each embedding with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the embeddings. Through the above two designs, we achieve cross-scale attention. Besides, we put forward a dynamic position bias for vision transformers to make the popular relative position bias apply to variable-sized images. Hinging on the cross-scale attention module, we construct a versatile vision architecture, dubbed CrossFormer, which accommodates variable-sized inputs. Extensive experiments show that CrossFormer outperforms the other vision transformers on image classification, object detection, instance segmentation, and semantic segmentation tasks. The code has been released: https://github.com/cheerss/CrossFormer.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Wenxiao Wang (63 papers)
  2. Lu Yao (6 papers)
  3. Long Chen (395 papers)
  4. Binbin Lin (50 papers)
  5. Deng Cai (181 papers)
  6. Xiaofei He (70 papers)
  7. Wei Liu (1135 papers)
Citations (221)

Summary

CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention

The paper presents CrossFormer, a novel vision transformer architecture designed to enhance feature interactions across multiple scales in computer vision tasks. CrossFormer addresses limitations in existing vision transformers that lack the capability to integrate features across different scales. This deficiency arises mainly because existing models operate with equal-sized input embeddings and subsequently merge adjacent embeddings to reduce computational complexity, which sacrifices fine-grained features.

Key Contributions

  1. Cross-scale Embedding Layer (CEL): This layer generates embeddings by sampling patches with kernels of varying sizes and concatenating them. This approach equips each embedding with cross-scale features, facilitating more comprehensive attention mechanisms.
  2. Long Short Distance Attention (LSDA): LSDA bifurcates the self-attention mechanism into short-distance and long-distance modules, ensuring the retention of both small- and large-scale features. The attention is apportioned between neighboring (short-distance) and distant (long-distance) embeddings, optimizing both computational efficiency and feature integration.
  3. Dynamic Position Bias (DPB): Co-opting a trainable module to dynamically compute position biases, DPB extends the applicability of relative position biases to variable image sizes, enhancing flexibility in processing diverse visual inputs.

Strong Numerical Results and Architectural Advancements

CrossFormer's architecture is delineated into four stages, employing a pyramid structure to manage embedding dimensions and computational load efficiently. The numerical evaluation reveals substantial performance gains:

  • Image Classification: CrossFormer outperforms other vision transformers with a notable advantage, achieving accuracy improvements of at least 1.2% over strong baselines such as DeiT, PVT, and Swin.
  • Object Detection: Using RetinaNet and Mask R-CNN, CrossFormer excels with higher AP metrics, with a performance increase of up to 1.9% on object detection and significant gains in instance segmentation tasks.
  • Semantic Segmentation: The model demonstrates superior performance in dense prediction tasks, particularly in semantic segmentation benchmarks, signifying its robustness in processing complex visual data with variable input sizes.

Theoretical and Practical Implications

The introduction of CEL and LSDA offers a novel framework for leveraging cross-scale features more effectively, which is particularly crucial for tasks requiring fine-grained spatial and contextual information, such as object detection and segmentation. This approach theoretically enriches the representational capacity of vision transformers by enhancing their ability to model long-distance dependencies without compromising computational efficiency.

Future Directions in AI Research

The innovations in CrossFormer suggest several avenues for future research:

  • Scalability and Generalization: Future work could explore extending CrossFormer's cross-scale attention mechanisms to other domains, such as video processing or 3D modeling, where multi-scale interactions are essential.
  • Optimization Techniques: Further refinement of dynamic position bias or exploration of alternative position encoding methods may yield improvements in both accuracy and efficiency.
  • Integration with Other Architectures: CrossFormer's components might be integrated into hybrid models that combine transformers with CNNs or other neural network architectures to further enhance performance across different tasks.

In summary, CrossFormer presents a significant stride in vision transformer architecture with its focus on cross-scale attention. Its substantial improvements in various vision tasks underscore the importance of integrating features across different scales, which could influence the future trajectory of AI research and applications.

Github Logo Streamline Icon: https://streamlinehq.com