Multi-Scale Weight-Sharing Window Attention
- Multi-scale weight-sharing window attention is a novel technique that partitions transformer heads into groups with varying window sizes to capture both local and global features.
- It employs a dynamic weight-sharing module that fuses multi-scale representations adaptively, ensuring efficient computation and robust hierarchical context capture.
- Empirical benchmarks on ImageNet, ADE20K, and COCO confirm consistent performance improvements over standard window attention methods with minimal additional resource cost.
Multi-scale weight-sharing window attention is a mechanism within vision transformers that generalizes local windowed self-attention by partitioning attention heads into groups operating on windows of multiple spatial scales, then dynamically fusing these multi-scale representations via learnable weight-sharing modules. This approach, notably implemented in Dynamic Window Vision Transformer (DW-ViT) and aligned to Lawin Transformer’s large-window attention, provides enhanced capability for capturing hierarchical context with computational efficiency, yielding consistent empirical improvements on image classification, segmentation, and detection benchmarks (Ren et al., 2022, Yan et al., 2022).
1. Mathematical Formulation of Multi-Scale Window Attention
Let be the input feature map. In DW-ViT, the total attention heads are split into groups; each group operates with window size . Each group contains heads, and per-head hidden dimension is . The feature map is reshaped as and divided along the head dimension into disjoint chunks . Each is partitioned into non-overlapping windows of size (, ).
Within each window for head group , projections are computed:
Attention is computed for each scale using relative bias: where indexes relative spatial displacements within the window from the parameter tensor .
Post-attention, all (unwindowed to ) are concatenated along channel dimension:
2. Dynamic Weight-Sharing Fusion Mechanism
After per-scale attention, dynamic multi-scale fusion (“DMSW” module) selects and mixes scale features adaptively:
- Fuse: Global average pooling ; non-linear projection layers .
- Select: Soft-attention scores over scales; , .
- Weighted sum & residual: ; residual restoring spatial dimension.
Parameters () are shared over spatial positions but not across scales. By construction, .
3. Integration into Transformer Architectures
Each block of DW-ViT replaces standard MSA with the MSW-MSA and DMSW fusion. The assignment of window sizes to head groups is resolution dependent, with empirically chosen lists (e.g., [7,14,21]) adapting at each stage (Ren et al., 2022). DMSW combines multi-scale tokens and outputs them after residual fusion.
Example block pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Input: X ∈ ℝ^{H×W×C} 1) X1 = LayerNorm(X) 2) # Multi-scale window MSA for i in 1…S: Xi = project_heads(X1, heads=((i−1)*h/S)…(i*h/S)) Wi_windows = window_partition(Xi, window_size=w_i) for each window j: Q,K,V = Wi_windows[j] · (W^Q_i,W^K_i,W^V_i) A_ij = Softmax(QKᵀ/√d_k + B_i)·V Yi = window_reverse({A_ij}, H,W) Y_MSA = Concat_i(Yi) # ℝ^{H×W×C} 3) # Dynamic fusion (DMSW) y_gp = avg_pool(Y_MSA) y_fuse = GELU(W₂·GELU(W₁·y_gp)) for i in 1…S: α_i = exp(v_iᵀ y_fuse)/∑_j exp(v_jᵀ y_fuse) Y_sel = ∑_i α_i·Y_i res = W₄·GELU(W₃·y_fuse) # reshape to H×W×C Y2 = Y_sel + res 4) X2 = X + Y2 5) X_out = X2 + MLP(LayerNorm(X2)) return X_out |
In Lawin Transformer, large-window attention is employed within a spatial pyramid pooling framework (LawinASPP), using multi-scale receptive fields (e.g., ) on top of hierarchical ViT encoders (Yan et al., 2022).
4. Computational Complexity and Parameterization
For DW-ViT MSW-MSA, the computational cost at each block is:
Overhead relative to single window is in , with parameters increasing by for DMSW layers and bias matrices . In Lawin Transformer, large-window attention branches induce a cost difference of per branch but total cost is comparable to stacked local windows, remaining efficient for practical (Yan et al., 2022).
5. Empirical Benchmark Performance
Multi-scale weight-sharing window attention consistently outperforms single-scale window transformers with minor resource increase.
| Model | Params (M) | FLOPs (G) | ImageNet Top-1 (%) | ADE20K mIoU (%) | COCO AP_box / AP_mask |
|---|---|---|---|---|---|
| Swin-T [std] | 28 | 4.5 | 81.3 | 44.5 | 46.0 / 41.6 |
| DW-T | 30 | 5.2 | 82.0 (+0.7) | 45.7 (+1.2) | 46.7 / 42.4 (+0.7/+0.8) |
| Swin-B [std] | 88 | 15.4 | 83.3 | 48.1 | 48.5 / 43.4 |
| DW-B | 91 | 17.0 | 83.8 (+0.5) | 48.7 (+0.6) | 49.2 / 44.0 (+0.7/+0.6) |
Lawin Transformer with large-window attention and multi-scale fusion yields ablation results indicating that each multi-scale branch (e.g., ) and global pooling contribute $0.4$- absolute mIoU (Yan et al., 2022). Removal of position-mixing MLPs results in up to performance drops, emphasizing their necessity for restoring fine spatial detail.
6. Relationship to Large-Window Attention and Multi-scale Pooling
Large-window attention within Lawin Transformer (Yan et al., 2022) operates by extending each window’s context via larger area queries—with average pooling and token-mixing MLP restoring spatial fidelity—yielding effective multi-scale context representations. Weight-sharing is realized across all windows at a given scale, and distinct trainable branches handle multi-scale aggregation. LawinASPP decoder generalizes atrous spatial pyramid pooling to multi-scale transformer attention, achieving efficient semantic segmentation.
In Lawin’s architecture, multi-level features (from hierarchical transformer encoder) are aggregated at several strides and fed through branches with receptive fields matching multiple scales and global context, each fused by simple concatenation and linear projection.
7. Context and Implications for Vision Transformers
Multi-scale weight-sharing window attention enables transformer backbones to capture both local and global dependencies with flexible windowing and efficient computation. The construction supports seamless integration into Swin-style architectures, scalable adaptation to input resolutions, and consistent empirical gains across image understanding tasks. This suggests that dynamic multi-scale modeling with adaptive fusion offers a generic replacement for fixed window transformers and convolutional pooling layers, while maintaining resource efficiency and deployment compatibility (Ren et al., 2022, Yan et al., 2022).