Dual-Attention Vision Transformer (DaViT)

Updated 5 June 2026

Dual-Attention Vision Transformer (DaViT) is a model that combines spatial window and channel group self-attention for efficient local and global feature modeling.
It leverages a dual-attention mechanism to alternate between fine-grained local feature refinement and global context aggregation while scaling linearly with image resolution and channels.
DaViT achieves state-of-the-art trade-offs across image classification, object detection, and semantic segmentation benchmarks, outperforming comparable models like Swin Transformer.

Dual-Attention Vision Transformers (DaViT) constitute a vision transformer architecture designed to balance global context modeling and computational efficiency by leveraging two orthogonal types of self-attention: spatial window attention and channel group attention. This dual-attention scheme enables the architecture to alternate between fine-grained local feature refinement and global feature interactions, providing significant gains in image classification, object detection, and semantic segmentation benchmarks. DaViT achieves state-of-the-art trade-offs between accuracy, parameter count, and computational cost, while scaling linearly with both spatial resolution and channel dimension (Ding et al., 2022).

1. Dual-Attention Design and Mechanisms

DaViT alternates two distinct but complementary forms of self-attention within each block:

Spatial window attention: Operates over “spatial tokens,” where each token corresponds to a spatial location (e.g., an image patch).
Channel group attention: Operates over “channel tokens,” where each token corresponds to a channel (feature map dimension), enabling global spatial aggregation.

1.1 Spatial Tokens and Window Self-Attention

Given a feature map $X \in \mathbb{R}^{P \times C}$ , with $P$ spatial positions and $C$ channels, each spatial token is a row $x_p \in \mathbb{R}^C$ . The $P$ tokens are partitioned into $N_w$ non-overlapping windows of size $P_w$ such that $P = N_w \cdot P_w$ . Within each window $i$ , multi-head self-attention is computed independently:

For head $h$ :

$P$ 0

$P$ 1

Aggregated output per window:

$P$ 2

The complexity per block is $P$ 3, linear in $P$ 4 as $P$ 5 is a fixed window size.

1.2 Channel Tokens and Grouped Self-Attention

Transposing $P$ 6 yields $P$ 7, where each “channel token” is the $P$ 8th row. Instead of global attention (quadratic in $P$ 9), DaViT divides $C$ 0 channels into $C$ 1 groups of size $C$ 2.

For each group $C$ 3:

$C$ 4

Single-head global attention is then applied within each group:

$C$ 5

$C$ 6

The outputs are concatenated:

$C$ 7

Channel attention thus models global context by letting each token aggregate information from all spatial positions, while grouping reduces the computational complexity in $C$ 8.

2. Token Grouping and Complexity Analysis

DaViT's efficiency arises from structured grouping in both attention types. Let $C$ 9 (spatial length), $x_p \in \mathbb{R}^C$ 0 (channel dim), $x_p \in \mathbb{R}^C$ 1, $x_p \in \mathbb{R}^C$ 2.

Attention Type	Pre-Grouping Complexity	Post-Grouping Complexity
Spatial attention	$x_p \in \mathbb{R}^C$ 3	$x_p \in \mathbb{R}^C$ 4
Channel attention	$x_p \in \mathbb{R}^C$ 5	$x_p \in \mathbb{R}^C$ 6

With fixed $x_p \in \mathbb{R}^C$ 7 and $x_p \in \mathbb{R}^C$ 8, both attention mechanisms scale linearly in $x_p \in \mathbb{R}^C$ 9 and $P$ 0, supporting efficient processing of high-resolution images and wide networks (Ding et al., 2022).

3. Architectural Variants and Network Staging

DaViT’s architecture is arranged in four stages, each comprising patch embedding followed by several alternating dual-attention blocks. The model’s four principal variants are:

Variant	C	L (per stage)	$P$ 1/ $P$ 2 per stage	Params (M)	FLOPs (G)	ImageNet Top-1 (%)
DaViT-Tiny	96	1,1,3,1	3,6,12,24	28.3	4.5	82.8
DaViT-Small	96	1,1,9,1	3,6,12,24	49.7	8.8	84.2
DaViT-Base	128	1,1,9,1	4,8,16,32	87.9	15.5	84.6
DaViT-Giant	384	1,1,12,3	12,24,48,96	~1,440	1,038	90.4*

*pre-trained on 1.5B image-text pairs; Top-1 is on ImageNet-1K.

The number of window heads $P$ 3 and channel groups $P$ 4 are matched per stage. DaViT-Giant leverages large-scale weakly supervised training for maximal performance.

4. Training Protocols and Hyperparameters

4.1 ImageNet-1K Image Classification

300 epochs, batch size 2048
AdamW optimizer, weight decay 0.05, gradient norm clip 1.0
Learning rate: triangular schedule with linear warmup and decay, peak $P$ 5
Data augmentation and regularization mirror DeiT (excluding repeated augmentation and EMA)
Stochastic depth: rates of 0.1 (Tiny), 0.2 (Small), 0.4 (Base)
Training employs random crop to $P$ 6, evaluation uses center crop

4.2 COCO 2017 Object Detection

DaViT backbones are integrated into RetinaNet and Mask R-CNN
Schedules: 1× (12 epochs), 3× (36 epochs), multi-scale training with image side $P$ 7
AdamW optimizer, initial LR $P$ 8, weight decay 0.05, stochastic depth as above
FLOPs reported at $P$ 9 resolution

4.3 ADE20K Semantic Segmentation

UPerNet framework, $N_w$ 0 input
160,000 iterations, batch size 16
Other hyperparameters consistent with established segmentation approaches, weight decay 0.05

5. Empirical Evaluation and Comparisons

DaViT demonstrates state-of-the-art performance across image recognition, detection, and segmentation tasks, outperforming Swin Transformer models given matched model size and FLOPs.

5.1 ImageNet-1K Classification

Model	Params (M)	FLOPs (G)	Top-1 (%)
DaViT-Tiny	28.3	4.5	82.8
Swin-Tiny	28.3	4.5	81.2
DaViT-Small	49.7	8.8	84.2
Swin-Small	49.6	8.7	83.1
DaViT-Base	87.9	15.5	84.6
Swin-Base	87.8	15.4	83.4

Pretraining on ImageNet-22K: DaViT-Base achieves 86.9% (vs. Swin-Large 86.4%). DaViT-Huge (90.2%), DaViT-Giant (90.4%) obtain top-tier results when pretrained on large-scale weakly supervised data.

5.2 COCO 2017 Detection, ADE20K Segmentation

Task	Model	AP $N_w$ 1 / mIoU	Improvement over Swin
COCO RetinaNet 3×	DaViT-Tiny	46.5	+1.5
	DaViT-Small	48.2	+1.8
	DaViT-Base	48.7	+2.9
Mask R-CNN 3×	DaViT-Tiny	47.4	+1.4
	DaViT-Small	49.5	+1.0
	DaViT-Base	49.9	+1.4
ADE20K (mIoU)	DaViT-Tiny	46.3	+1.8
	DaViT-Small	48.8	+1.2
	DaViT-Base	49.4	+1.3

DaViT consistently surpasses Swin Transformers in both accuracy and computational efficiency for all evaluated tasks (Ding et al., 2022).

6. Synthesis and Implications

DaViT’s combination of local spatial window attention and global channel group attention enables explicit modeling of both fine-grained local structures and holistic contextual dependencies. Because channel attention integrates spatial information globally within each token and spatial attention preserves local relationships, these mechanisms are complementary. The architectural strategy of alternating and grouping tokens ensures all attention remains linear in both spatial and channel dimensions, supporting scaling to larger images and models. This suggests extensibility to high-resolution vision tasks and large-scale pretrained regimes, as manifested in DaViT-Giant’s results on ImageNet-1K following weakly supervised training with 1.5B image-text pairs.

DaViT demonstrates that a dual-attention transformer design, grounded in efficient grouping and alternation, can achieve superior trade-offs among computational cost, parameter count, and task accuracy across a range of modern vision benchmarks (Ding et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

DaViT: Dual Attention Vision Transformers (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-Attention Vision Transformer (DaViT).

Dual-Attention Vision Transformer (DaViT)

1. Dual-Attention Design and Mechanisms

1.1 Spatial Tokens and Window Self-Attention

1.2 Channel Tokens and Grouped Self-Attention

2. Token Grouping and Complexity Analysis

3. Architectural Variants and Network Staging

4. Training Protocols and Hyperparameters

4.1 ImageNet-1K Image Classification

4.2 COCO 2017 Object Detection

4.3 ADE20K Semantic Segmentation

5. Empirical Evaluation and Comparisons

5.1 ImageNet-1K Classification

5.2 COCO 2017 Detection, ADE20K Segmentation

6. Synthesis and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Dual-Attention Vision Transformer (DaViT)

1. Dual-Attention Design and Mechanisms

1.1 Spatial Tokens and Window Self-Attention

1.2 Channel Tokens and Grouped Self-Attention

2. Token Grouping and Complexity Analysis

3. Architectural Variants and Network Staging

4. Training Protocols and Hyperparameters

4.1 ImageNet-1K Image Classification

4.2 COCO 2017 Object Detection

4.3 ADE20K Semantic Segmentation

5. Empirical Evaluation and Comparisons

5.1 ImageNet-1K Classification

5.2 COCO 2017 Detection, ADE20K Segmentation

6. Synthesis and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research