Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dual-Attention Vision Transformer (DaViT)

Updated 5 June 2026
  • Dual-Attention Vision Transformer (DaViT) is a model that combines spatial window and channel group self-attention for efficient local and global feature modeling.
  • It leverages a dual-attention mechanism to alternate between fine-grained local feature refinement and global context aggregation while scaling linearly with image resolution and channels.
  • DaViT achieves state-of-the-art trade-offs across image classification, object detection, and semantic segmentation benchmarks, outperforming comparable models like Swin Transformer.

Dual-Attention Vision Transformers (DaViT) constitute a vision transformer architecture designed to balance global context modeling and computational efficiency by leveraging two orthogonal types of self-attention: spatial window attention and channel group attention. This dual-attention scheme enables the architecture to alternate between fine-grained local feature refinement and global feature interactions, providing significant gains in image classification, object detection, and semantic segmentation benchmarks. DaViT achieves state-of-the-art trade-offs between accuracy, parameter count, and computational cost, while scaling linearly with both spatial resolution and channel dimension (Ding et al., 2022).

1. Dual-Attention Design and Mechanisms

DaViT alternates two distinct but complementary forms of self-attention within each block:

  • Spatial window attention: Operates over “spatial tokens,” where each token corresponds to a spatial location (e.g., an image patch).
  • Channel group attention: Operates over “channel tokens,” where each token corresponds to a channel (feature map dimension), enabling global spatial aggregation.

1.1 Spatial Tokens and Window Self-Attention

Given a feature map XRP×CX \in \mathbb{R}^{P \times C}, with PP spatial positions and CC channels, each spatial token is a row xpRCx_p \in \mathbb{R}^C. The PP tokens are partitioned into NwN_w non-overlapping windows of size PwP_w such that P=NwPwP = N_w \cdot P_w. Within each window ii, multi-head self-attention is computed independently:

For head hh:

PP0

PP1

Aggregated output per window:

PP2

The complexity per block is PP3, linear in PP4 as PP5 is a fixed window size.

1.2 Channel Tokens and Grouped Self-Attention

Transposing PP6 yields PP7, where each “channel token” is the PP8th row. Instead of global attention (quadratic in PP9), DaViT divides CC0 channels into CC1 groups of size CC2.

For each group CC3:

CC4

Single-head global attention is then applied within each group:

CC5

CC6

The outputs are concatenated:

CC7

Channel attention thus models global context by letting each token aggregate information from all spatial positions, while grouping reduces the computational complexity in CC8.

2. Token Grouping and Complexity Analysis

DaViT's efficiency arises from structured grouping in both attention types. Let CC9 (spatial length), xpRCx_p \in \mathbb{R}^C0 (channel dim), xpRCx_p \in \mathbb{R}^C1, xpRCx_p \in \mathbb{R}^C2.

Attention Type Pre-Grouping Complexity Post-Grouping Complexity
Spatial attention xpRCx_p \in \mathbb{R}^C3 xpRCx_p \in \mathbb{R}^C4
Channel attention xpRCx_p \in \mathbb{R}^C5 xpRCx_p \in \mathbb{R}^C6

With fixed xpRCx_p \in \mathbb{R}^C7 and xpRCx_p \in \mathbb{R}^C8, both attention mechanisms scale linearly in xpRCx_p \in \mathbb{R}^C9 and PP0, supporting efficient processing of high-resolution images and wide networks (Ding et al., 2022).

3. Architectural Variants and Network Staging

DaViT’s architecture is arranged in four stages, each comprising patch embedding followed by several alternating dual-attention blocks. The model’s four principal variants are:

Variant C L (per stage) PP1/PP2 per stage Params (M) FLOPs (G) ImageNet Top-1 (%)
DaViT-Tiny 96 1,1,3,1 3,6,12,24 28.3 4.5 82.8
DaViT-Small 96 1,1,9,1 3,6,12,24 49.7 8.8 84.2
DaViT-Base 128 1,1,9,1 4,8,16,32 87.9 15.5 84.6
DaViT-Giant 384 1,1,12,3 12,24,48,96 ~1,440 1,038 90.4*

*pre-trained on 1.5B image-text pairs; Top-1 is on ImageNet-1K.

The number of window heads PP3 and channel groups PP4 are matched per stage. DaViT-Giant leverages large-scale weakly supervised training for maximal performance.

4. Training Protocols and Hyperparameters

4.1 ImageNet-1K Image Classification

  • 300 epochs, batch size 2048
  • AdamW optimizer, weight decay 0.05, gradient norm clip 1.0
  • Learning rate: triangular schedule with linear warmup and decay, peak PP5
  • Data augmentation and regularization mirror DeiT (excluding repeated augmentation and EMA)
  • Stochastic depth: rates of 0.1 (Tiny), 0.2 (Small), 0.4 (Base)
  • Training employs random crop to PP6, evaluation uses center crop

4.2 COCO 2017 Object Detection

  • DaViT backbones are integrated into RetinaNet and Mask R-CNN
  • Schedules: 1× (12 epochs), 3× (36 epochs), multi-scale training with image side PP7
  • AdamW optimizer, initial LR PP8, weight decay 0.05, stochastic depth as above
  • FLOPs reported at PP9 resolution

4.3 ADE20K Semantic Segmentation

  • UPerNet framework, NwN_w0 input
  • 160,000 iterations, batch size 16
  • Other hyperparameters consistent with established segmentation approaches, weight decay 0.05

5. Empirical Evaluation and Comparisons

DaViT demonstrates state-of-the-art performance across image recognition, detection, and segmentation tasks, outperforming Swin Transformer models given matched model size and FLOPs.

5.1 ImageNet-1K Classification

Model Params (M) FLOPs (G) Top-1 (%)
DaViT-Tiny 28.3 4.5 82.8
Swin-Tiny 28.3 4.5 81.2
DaViT-Small 49.7 8.8 84.2
Swin-Small 49.6 8.7 83.1
DaViT-Base 87.9 15.5 84.6
Swin-Base 87.8 15.4 83.4

Pretraining on ImageNet-22K: DaViT-Base achieves 86.9% (vs. Swin-Large 86.4%). DaViT-Huge (90.2%), DaViT-Giant (90.4%) obtain top-tier results when pretrained on large-scale weakly supervised data.

5.2 COCO 2017 Detection, ADE20K Segmentation

Task Model APNwN_w1 / mIoU Improvement over Swin
COCO RetinaNet 3× DaViT-Tiny 46.5 +1.5
DaViT-Small 48.2 +1.8
DaViT-Base 48.7 +2.9
Mask R-CNN DaViT-Tiny 47.4 +1.4
DaViT-Small 49.5 +1.0
DaViT-Base 49.9 +1.4
ADE20K (mIoU) DaViT-Tiny 46.3 +1.8
DaViT-Small 48.8 +1.2
DaViT-Base 49.4 +1.3

DaViT consistently surpasses Swin Transformers in both accuracy and computational efficiency for all evaluated tasks (Ding et al., 2022).

6. Synthesis and Implications

DaViT’s combination of local spatial window attention and global channel group attention enables explicit modeling of both fine-grained local structures and holistic contextual dependencies. Because channel attention integrates spatial information globally within each token and spatial attention preserves local relationships, these mechanisms are complementary. The architectural strategy of alternating and grouping tokens ensures all attention remains linear in both spatial and channel dimensions, supporting scaling to larger images and models. This suggests extensibility to high-resolution vision tasks and large-scale pretrained regimes, as manifested in DaViT-Giant’s results on ImageNet-1K following weakly supervised training with 1.5B image-text pairs.

DaViT demonstrates that a dual-attention transformer design, grounded in efficient grouping and alternation, can achieve superior trade-offs among computational cost, parameter count, and task accuracy across a range of modern vision benchmarks (Ding et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-Attention Vision Transformer (DaViT).