DaViT: Dual Attention Vision Transformers (2204.03645v1)

Published 7 Apr 2022 in cs.CV

Abstract: In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "spatial tokens" and "channel tokens". With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K. Code is available at https://github.com/dingmyu/davit.

Authors (6)

Mingyu Ding (82 papers)
Bin Xiao (93 papers)
Noel Codella (21 papers)
Ping Luo (340 papers)
Jingdong Wang (236 papers)
Lu Yuan (130 papers)

Citations (193)

View on Semantic Scholar

Summary

DaViT: Dual Attention Vision Transformers

The paper "DaViT: Dual Attention Vision Transformers" introduces an innovative vision transformer architecture designed to efficiently capture both global and local contexts in high-resolution vision tasks. This architecture, known as Dual Attention Vision Transformers (DaViT), employs a dual attention mechanism integrating spatial and channel attentions.

Core Contributions

The primary contribution of this work is the innovative dual attention mechanism:

Spatial Window Attention: This self-attention mechanism focuses on local regions by partitioning image features into non-overlapping windows, maintaining computational efficiency with linear scalability concerning spatial dimensions.
Channel Group Attention: This novel approach transposes the feature matrices to approach self-attention from an orthogonal perspective. Each channel token encapsulates a global view of the image, allowing for dynamic feature fusion across these global tokens. It achieves linear complexity concerning both spatial and channel dimensions by grouping channels.

The paper meticulously explores the complementary relationship between these two forms of attention. Spatial attention refines local details, while channel attention aggregates global information, creating a balance that enhances modeling capacity while being computationally efficient.

Empirical Evaluation

DaViT's efficacy is evidenced through rigorous testing across multiple benchmarks:

Image Classification: DaViT models achieve superior accuracy with efficient parameter counts and FLOPs on ImageNet-1K, notably reaching up to 90.4% top-1 accuracy when scaled with extensive data pre-training.
Object Detection and Segmentation: In COCO object detection and ADE20K semantic segmentation, DaViT consistently outperforms contemporary architectures, demonstrating better scalability and enhanced capabilities in both efficient computational handling and task accuracy.

Implications and Future Prospects

The introduction of channel group attention is a significant theoretical advancement, demonstrating how global contexts can be efficiently captured through dimension transposition and interactions within channel groups. This opens new avenues for exploring efficient self-attention mechanisms that decouple spatial and feature dimensions.

Practically, DaViT stands to influence further development of vision transformers aiming for adoption in devices and scenarios where computational resources are constrained.

Speculation on AI Development

Given the computational efficiency and scalability of DaViT, it is poised to inspire further advancements in transformer architectures beyond visual processing. Future research might explore similar orthogonal attention mechanisms in text and multi-modal contexts, possibly harmonizing transformer efficiency across different domains.

DaViT represents a pivotal step towards creating more adaptable and efficient vision models, highlighting the need for architectures that seamlessly integrate local and global feature interactions without compromising computational resources.

PDF Markdown

Related Papers

Focal Self-attention for Local-Global Interactions in Vision Transformers (2021)
MaxViT: Multi-Axis Vision Transformer (2022)
XCiT: Cross-Covariance Image Transformers (2021)
Dual Vision Transformer (2022)
Vision Transformer with Super Token Sampling (2022)

Find Related Papers

GitHub

GitHub - dingmyu/davit: [ECCV 2022]Code for paper "DaViT: Dual Attention Vision Transformer" (330 stars)