Dual-Attention Vision Transformer (DaViT)
- Dual-Attention Vision Transformer (DaViT) is a model that combines spatial window and channel group self-attention for efficient local and global feature modeling.
- It leverages a dual-attention mechanism to alternate between fine-grained local feature refinement and global context aggregation while scaling linearly with image resolution and channels.
- DaViT achieves state-of-the-art trade-offs across image classification, object detection, and semantic segmentation benchmarks, outperforming comparable models like Swin Transformer.
Dual-Attention Vision Transformers (DaViT) constitute a vision transformer architecture designed to balance global context modeling and computational efficiency by leveraging two orthogonal types of self-attention: spatial window attention and channel group attention. This dual-attention scheme enables the architecture to alternate between fine-grained local feature refinement and global feature interactions, providing significant gains in image classification, object detection, and semantic segmentation benchmarks. DaViT achieves state-of-the-art trade-offs between accuracy, parameter count, and computational cost, while scaling linearly with both spatial resolution and channel dimension (Ding et al., 2022).
1. Dual-Attention Design and Mechanisms
DaViT alternates two distinct but complementary forms of self-attention within each block:
- Spatial window attention: Operates over “spatial tokens,” where each token corresponds to a spatial location (e.g., an image patch).
- Channel group attention: Operates over “channel tokens,” where each token corresponds to a channel (feature map dimension), enabling global spatial aggregation.
1.1 Spatial Tokens and Window Self-Attention
Given a feature map , with spatial positions and channels, each spatial token is a row . The tokens are partitioned into non-overlapping windows of size such that . Within each window , multi-head self-attention is computed independently:
For head :
0
1
Aggregated output per window:
2
The complexity per block is 3, linear in 4 as 5 is a fixed window size.
1.2 Channel Tokens and Grouped Self-Attention
Transposing 6 yields 7, where each “channel token” is the 8th row. Instead of global attention (quadratic in 9), DaViT divides 0 channels into 1 groups of size 2.
For each group 3:
4
Single-head global attention is then applied within each group:
5
6
The outputs are concatenated:
7
Channel attention thus models global context by letting each token aggregate information from all spatial positions, while grouping reduces the computational complexity in 8.
2. Token Grouping and Complexity Analysis
DaViT's efficiency arises from structured grouping in both attention types. Let 9 (spatial length), 0 (channel dim), 1, 2.
| Attention Type | Pre-Grouping Complexity | Post-Grouping Complexity |
|---|---|---|
| Spatial attention | 3 | 4 |
| Channel attention | 5 | 6 |
With fixed 7 and 8, both attention mechanisms scale linearly in 9 and 0, supporting efficient processing of high-resolution images and wide networks (Ding et al., 2022).
3. Architectural Variants and Network Staging
DaViT’s architecture is arranged in four stages, each comprising patch embedding followed by several alternating dual-attention blocks. The model’s four principal variants are:
| Variant | C | L (per stage) | 1/2 per stage | Params (M) | FLOPs (G) | ImageNet Top-1 (%) |
|---|---|---|---|---|---|---|
| DaViT-Tiny | 96 | 1,1,3,1 | 3,6,12,24 | 28.3 | 4.5 | 82.8 |
| DaViT-Small | 96 | 1,1,9,1 | 3,6,12,24 | 49.7 | 8.8 | 84.2 |
| DaViT-Base | 128 | 1,1,9,1 | 4,8,16,32 | 87.9 | 15.5 | 84.6 |
| DaViT-Giant | 384 | 1,1,12,3 | 12,24,48,96 | ~1,440 | 1,038 | 90.4* |
*pre-trained on 1.5B image-text pairs; Top-1 is on ImageNet-1K.
The number of window heads 3 and channel groups 4 are matched per stage. DaViT-Giant leverages large-scale weakly supervised training for maximal performance.
4. Training Protocols and Hyperparameters
4.1 ImageNet-1K Image Classification
- 300 epochs, batch size 2048
- AdamW optimizer, weight decay 0.05, gradient norm clip 1.0
- Learning rate: triangular schedule with linear warmup and decay, peak 5
- Data augmentation and regularization mirror DeiT (excluding repeated augmentation and EMA)
- Stochastic depth: rates of 0.1 (Tiny), 0.2 (Small), 0.4 (Base)
- Training employs random crop to 6, evaluation uses center crop
4.2 COCO 2017 Object Detection
- DaViT backbones are integrated into RetinaNet and Mask R-CNN
- Schedules: 1× (12 epochs), 3× (36 epochs), multi-scale training with image side 7
- AdamW optimizer, initial LR 8, weight decay 0.05, stochastic depth as above
- FLOPs reported at 9 resolution
4.3 ADE20K Semantic Segmentation
- UPerNet framework, 0 input
- 160,000 iterations, batch size 16
- Other hyperparameters consistent with established segmentation approaches, weight decay 0.05
5. Empirical Evaluation and Comparisons
DaViT demonstrates state-of-the-art performance across image recognition, detection, and segmentation tasks, outperforming Swin Transformer models given matched model size and FLOPs.
5.1 ImageNet-1K Classification
| Model | Params (M) | FLOPs (G) | Top-1 (%) |
|---|---|---|---|
| DaViT-Tiny | 28.3 | 4.5 | 82.8 |
| Swin-Tiny | 28.3 | 4.5 | 81.2 |
| DaViT-Small | 49.7 | 8.8 | 84.2 |
| Swin-Small | 49.6 | 8.7 | 83.1 |
| DaViT-Base | 87.9 | 15.5 | 84.6 |
| Swin-Base | 87.8 | 15.4 | 83.4 |
Pretraining on ImageNet-22K: DaViT-Base achieves 86.9% (vs. Swin-Large 86.4%). DaViT-Huge (90.2%), DaViT-Giant (90.4%) obtain top-tier results when pretrained on large-scale weakly supervised data.
5.2 COCO 2017 Detection, ADE20K Segmentation
| Task | Model | AP1 / mIoU | Improvement over Swin |
|---|---|---|---|
| COCO RetinaNet 3× | DaViT-Tiny | 46.5 | +1.5 |
| DaViT-Small | 48.2 | +1.8 | |
| DaViT-Base | 48.7 | +2.9 | |
| Mask R-CNN 3× | DaViT-Tiny | 47.4 | +1.4 |
| DaViT-Small | 49.5 | +1.0 | |
| DaViT-Base | 49.9 | +1.4 | |
| ADE20K (mIoU) | DaViT-Tiny | 46.3 | +1.8 |
| DaViT-Small | 48.8 | +1.2 | |
| DaViT-Base | 49.4 | +1.3 |
DaViT consistently surpasses Swin Transformers in both accuracy and computational efficiency for all evaluated tasks (Ding et al., 2022).
6. Synthesis and Implications
DaViT’s combination of local spatial window attention and global channel group attention enables explicit modeling of both fine-grained local structures and holistic contextual dependencies. Because channel attention integrates spatial information globally within each token and spatial attention preserves local relationships, these mechanisms are complementary. The architectural strategy of alternating and grouping tokens ensures all attention remains linear in both spatial and channel dimensions, supporting scaling to larger images and models. This suggests extensibility to high-resolution vision tasks and large-scale pretrained regimes, as manifested in DaViT-Giant’s results on ImageNet-1K following weakly supervised training with 1.5B image-text pairs.
DaViT demonstrates that a dual-attention transformer design, grounded in efficient grouping and alternation, can achieve superior trade-offs among computational cost, parameter count, and task accuracy across a range of modern vision benchmarks (Ding et al., 2022).