Papers
Topics
Authors
Recent
2000 character limit reached

Multi-Scale Weight-Sharing Window Attention

Updated 16 December 2025
  • Multi-scale weight-sharing window attention is a novel technique that partitions transformer heads into groups with varying window sizes to capture both local and global features.
  • It employs a dynamic weight-sharing module that fuses multi-scale representations adaptively, ensuring efficient computation and robust hierarchical context capture.
  • Empirical benchmarks on ImageNet, ADE20K, and COCO confirm consistent performance improvements over standard window attention methods with minimal additional resource cost.

Multi-scale weight-sharing window attention is a mechanism within vision transformers that generalizes local windowed self-attention by partitioning attention heads into groups operating on windows of multiple spatial scales, then dynamically fusing these multi-scale representations via learnable weight-sharing modules. This approach, notably implemented in Dynamic Window Vision Transformer (DW-ViT) and aligned to Lawin Transformer’s large-window attention, provides enhanced capability for capturing hierarchical context with computational efficiency, yielding consistent empirical improvements on image classification, segmentation, and detection benchmarks (Ren et al., 2022, Yan et al., 2022).

1. Mathematical Formulation of Multi-Scale Window Attention

Let XRH×W×CX\in\mathbb{R}^{H\times W\times C} be the input feature map. In DW-ViT, the total hh attention heads are split into SS groups; each group ii operates with window size wiw_i. Each group contains Hi=h/SH_i=h/S heads, and per-head hidden dimension is dk=C/hd_k=C/h. The feature map is reshaped as X^Rh×H×W×dk\hat{X}\in\mathbb{R}^{h\times H\times W\times d_k} and divided along the head dimension into SS disjoint chunks X^iRH×W×(Hidk)\hat{X}_i\in\mathbb{R}^{H\times W\times (H_i d_k)}. Each X^i\hat{X}_i is partitioned into non-overlapping windows of size wi×wiw_i\times w_i (mi=wi2m_i=w_i^2, NiH/wiW/wiN_i\approx \lceil H/w_i \rceil\cdot\lceil W/w_i \rceil).

Within each window jj for head group ii, projections are computed: Qi,j=X^i,jWiQ,Ki,j=X^i,jWiK,Vi,j=X^i,jWiV,WR(Hidk)×(Hidk)Q_{i,j} = \hat{X}_{i,j} W^Q_i,\quad K_{i,j} = \hat{X}_{i,j} W^K_i,\quad V_{i,j} = \hat{X}_{i,j} W^V_i,\quad W^* \in \mathbb{R}^{(H_i d_k)\times(H_i d_k)}

Attention is computed for each scale using relative bias: Ai=Softmax(QiKidk+Bi)ViA_i = \mathrm{Softmax}\Big( \frac{Q_i K_i^\top}{\sqrt{d_k}} + B_{i} \Big)\,V_{i} where BiRmi×miB_{i}\in\mathbb{R}^{m_i\times m_i} indexes relative spatial displacements within the window from the parameter tensor B^iR(2wi1)×(2wi1)\hat{B}_i\in\mathbb{R}^{(2w_i-1)\times(2w_i-1)}.

Post-attention, all AiA_i (unwindowed to YiRH×W×(Hidk)Y_i\in\mathbb{R}^{H\times W\times(H_i d_k)}) are concatenated along channel dimension: YMSA=Concati(Yi)RH×W×CY_{\mathrm{MSA}} = \mathrm{Concat}_i(Y_i)\in\mathbb{R}^{H\times W\times C}

2. Dynamic Weight-Sharing Fusion Mechanism

After per-scale attention, dynamic multi-scale fusion (“DMSW” module) selects and mixes scale features adaptively:

  • Fuse: Global average pooling ygp=Fgp(Y)y_{gp}=F_{gp}(Y); non-linear projection layers yfuse=GELU(W2GELU(W1ygp))y_{fuse}=\mathrm{GELU}(W_2\,\mathrm{GELU}(W_1\,y_{gp})).
  • Select: Soft-attention scores αi\alpha_i over scales; si=viyfuses_i=v_i^\top y_{fuse}, αi=exp(si)/j=1Sexp(sj)\alpha_i= \exp(s_i)/\sum_{j=1}^S\exp(s_j).
  • Weighted sum & residual: Ysel=i=1SαiYiY_{sel} = \sum_{i=1}^S \alpha_i Y_i; residual Yout=Ysel+Expand(W4GELU(W3yfuse))Y_{out} = Y_{sel} + \mathrm{Expand}(W_4\,\mathrm{GELU}(W_3\,y_{fuse})) restoring spatial dimension.

Parameters (vi,W1,W2,W3,W4v_i,W_1,W_2,W_3,W_4) are shared over spatial positions but not across scales. By construction, iαi=1\sum_i\alpha_i=1.

3. Integration into Transformer Architectures

Each block of DW-ViT replaces standard MSA with the MSW-MSA and DMSW fusion. The assignment of window sizes to head groups is resolution dependent, with empirically chosen lists (e.g., [7,14,21]) adapting at each stage (Ren et al., 2022). DMSW combines multi-scale tokens and outputs them after residual fusion.

Example block pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Input: X  ℝ^{H×W×C}
1) X1 = LayerNorm(X)
2) # Multi-scale window MSA
   for i in 1S:
       Xi = project_heads(X1, heads=((i1)*h/S)(i*h/S))
       Wi_windows = window_partition(Xi, window_size=w_i)
       for each window j:
          Q,K,V = Wi_windows[j] · (W^Q_i,W^K_i,W^V_i)
          A_ij = Softmax(QKᵀ/d_k + B_i)·V
       Yi = window_reverse({A_ij}, H,W)
   Y_MSA = Concat_i(Yi)  # ℝ^{H×W×C}
3) # Dynamic fusion (DMSW)
   y_gp = avg_pool(Y_MSA)
   y_fuse = GELU(W·GELU(W·y_gp))
   for i in 1S:  α_i = exp(v_iᵀ y_fuse)/_j exp(v_jᵀ y_fuse)
   Y_sel = _i α_i·Y_i
   res  = W·GELU(W·y_fuse)     # reshape to H×W×C
   Y2   = Y_sel + res
4) X2 = X + Y2
5) X_out = X2 + MLP(LayerNorm(X2))
return X_out

In Lawin Transformer, large-window attention is employed within a spatial pyramid pooling framework (LawinASPP), using multi-scale receptive fields (e.g., R{2,4,8}R\in\{2,4,8\}) on top of hierarchical ViT encoders (Yan et al., 2022).

4. Computational Complexity and Parameterization

For DW-ViT MSW-MSA, the computational cost at each block is: ΩWMSA4NC2+2NCM2\Omega_{\mathrm{W-MSA}} \simeq 4NC^2 + 2NC M^2

ΩMSWMSA=4NC2+2NCSi=1Swi2\Omega_{\mathrm{MSW-MSA}} = 4NC^2 + \frac{2NC}{S}\sum_{i=1}^S w_i^2

ΩDMSW=(1+N(1+1/S))C2S\Omega_{\mathrm{DMSW}} = (1 + N(1 + 1/S))\,\frac{C^2}{S}

Overhead relative to single window is O(1)O(1) in NN, with parameters increasing by O(C2/S)O(C^2/S) for DMSW layers and SS bias matrices (2wi1)2(2w_i-1)^2. In Lawin Transformer, large-window attention branches induce a cost difference of +(HW)P2C+(HW)P^2C per branch but total cost is comparable to stacked local windows, remaining efficient for practical PCP\ll C (Yan et al., 2022).

5. Empirical Benchmark Performance

Multi-scale weight-sharing window attention consistently outperforms single-scale window transformers with minor resource increase.

Model Params (M) FLOPs (G) ImageNet Top-1 (%) ADE20K mIoU (%) COCO AP_box / AP_mask
Swin-T [std] 28 4.5 81.3 44.5 46.0 / 41.6
DW-T 30 5.2 82.0 (+0.7) 45.7 (+1.2) 46.7 / 42.4 (+0.7/+0.8)
Swin-B [std] 88 15.4 83.3 48.1 48.5 / 43.4
DW-B 91 17.0 83.8 (+0.5) 48.7 (+0.6) 49.2 / 44.0 (+0.7/+0.6)

Lawin Transformer with large-window attention and multi-scale fusion yields ablation results indicating that each multi-scale branch (e.g., R=2,4,8R=2,4,8) and global pooling contribute $0.4$-1.0%1.0\% absolute mIoU (Yan et al., 2022). Removal of position-mixing MLPs results in up to 2.0%2.0\% performance drops, emphasizing their necessity for restoring fine spatial detail.

6. Relationship to Large-Window Attention and Multi-scale Pooling

Large-window attention within Lawin Transformer (Yan et al., 2022) operates by extending each window’s context via larger area queries—with average pooling and token-mixing MLP restoring spatial fidelity—yielding effective multi-scale context representations. Weight-sharing is realized across all windows at a given scale, and distinct trainable branches handle multi-scale aggregation. LawinASPP decoder generalizes atrous spatial pyramid pooling to multi-scale transformer attention, achieving efficient semantic segmentation.

In Lawin’s architecture, multi-level features (from hierarchical transformer encoder) are aggregated at several strides and fed through branches with receptive fields matching multiple scales and global context, each fused by simple concatenation and linear projection.

7. Context and Implications for Vision Transformers

Multi-scale weight-sharing window attention enables transformer backbones to capture both local and global dependencies with flexible windowing and efficient computation. The construction supports seamless integration into Swin-style architectures, scalable adaptation to input resolutions, and consistent empirical gains across image understanding tasks. This suggests that dynamic multi-scale modeling with adaptive fusion offers a generic replacement for fixed window transformers and convolutional pooling layers, while maintaining resource efficiency and deployment compatibility (Ren et al., 2022, Yan et al., 2022).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Multi-Scale Weight-Sharing Window Attention.