Multi-Scale Weight-Sharing Window Attention

Updated 16 December 2025

Multi-scale weight-sharing window attention is a novel technique that partitions transformer heads into groups with varying window sizes to capture both local and global features.
It employs a dynamic weight-sharing module that fuses multi-scale representations adaptively, ensuring efficient computation and robust hierarchical context capture.
Empirical benchmarks on ImageNet, ADE20K, and COCO confirm consistent performance improvements over standard window attention methods with minimal additional resource cost.

Multi-scale weight-sharing window attention is a mechanism within vision transformers that generalizes local windowed self-attention by partitioning attention heads into groups operating on windows of multiple spatial scales, then dynamically fusing these multi-scale representations via learnable weight-sharing modules. This approach, notably implemented in Dynamic Window Vision Transformer (DW-ViT) and aligned to Lawin Transformer’s large-window attention, provides enhanced capability for capturing hierarchical context with computational efficiency, yielding consistent empirical improvements on image classification, segmentation, and detection benchmarks (Ren et al., 2022, Yan et al., 2022).

1. Mathematical Formulation of Multi-Scale Window Attention

Let $X\in\mathbb{R}^{H\times W\times C}$ be the input feature map. In DW-ViT, the total $h$ attention heads are split into $S$ groups; each group $i$ operates with window size $w_i$ . Each group contains $H_i=h/S$ heads, and per-head hidden dimension is $d_k=C/h$ . The feature map is reshaped as $\hat{X}\in\mathbb{R}^{h\times H\times W\times d_k}$ and divided along the head dimension into $S$ disjoint chunks $\hat{X}_i\in\mathbb{R}^{H\times W\times (H_i d_k)}$ . Each $\hat{X}_i$ is partitioned into non-overlapping windows of size $w_i\times w_i$ ( $m_i=w_i^2$ , $N_i\approx \lceil H/w_i \rceil\cdot\lceil W/w_i \rceil$ ).

Within each window $j$ for head group $i$ , projections are computed: $Q_{i,j} = \hat{X}_{i,j} W^Q_i,\quad K_{i,j} = \hat{X}_{i,j} W^K_i,\quad V_{i,j} = \hat{X}_{i,j} W^V_i,\quad W^* \in \mathbb{R}^{(H_i d_k)\times(H_i d_k)}$

Attention is computed for each scale using relative bias: $A_i = \mathrm{Softmax}\Big( \frac{Q_i K_i^\top}{\sqrt{d_k}} + B_{i} \Big)\,V_{i}$ where $B_{i}\in\mathbb{R}^{m_i\times m_i}$ indexes relative spatial displacements within the window from the parameter tensor $\hat{B}_i\in\mathbb{R}^{(2w_i-1)\times(2w_i-1)}$ .

Post-attention, all $A_i$ (unwindowed to $Y_i\in\mathbb{R}^{H\times W\times(H_i d_k)}$ ) are concatenated along channel dimension: $Y_{\mathrm{MSA}} = \mathrm{Concat}_i(Y_i)\in\mathbb{R}^{H\times W\times C}$

After per-scale attention, dynamic multi-scale fusion (“DMSW” module) selects and mixes scale features adaptively:

Fuse: Global average pooling $y_{gp}=F_{gp}(Y)$ ; non-linear projection layers $y_{fuse}=\mathrm{GELU}(W_2\,\mathrm{GELU}(W_1\,y_{gp}))$ .
Select: Soft-attention scores $\alpha_i$ over scales; $s_i=v_i^\top y_{fuse}$ , $\alpha_i= \exp(s_i)/\sum_{j=1}^S\exp(s_j)$ .
Weighted sum & residual: $Y_{sel} = \sum_{i=1}^S \alpha_i Y_i$ ; residual $Y_{out} = Y_{sel} + \mathrm{Expand}(W_4\,\mathrm{GELU}(W_3\,y_{fuse}))$ restoring spatial dimension.

Parameters ( $v_i,W_1,W_2,W_3,W_4$ ) are shared over spatial positions but not across scales. By construction, $\sum_i\alpha_i=1$ .

3. Integration into Transformer Architectures

Each block of DW-ViT replaces standard MSA with the MSW-MSA and DMSW fusion. The assignment of window sizes to head groups is resolution dependent, with empirically chosen lists (e.g., [7,14,21]) adapting at each stage (Ren et al., 2022). DMSW combines multi-scale tokens and outputs them after residual fusion.

Example block pseudocode:

Input: X ∈ ℝ^{H×W×C}
1) X1 = LayerNorm(X)
2) # Multi-scale window MSA
   for i in 1…S:
       Xi = project_heads(X1, heads=((i−1)*h/S)…(i*h/S))
       Wi_windows = window_partition(Xi, window_size=w_i)
       for each window j:
          Q,K,V = Wi_windows[j] · (W^Q_i,W^K_i,W^V_i)
          A_ij = Softmax(QKᵀ/√d_k + B_i)·V
       Yi = window_reverse({A_ij}, H,W)
   Y_MSA = Concat_i(Yi)  # ℝ^{H×W×C}
3) # Dynamic fusion (DMSW)
   y_gp = avg_pool(Y_MSA)
   y_fuse = GELU(W₂·GELU(W₁·y_gp))
   for i in 1…S:  α_i = exp(v_iᵀ y_fuse)/∑_j exp(v_jᵀ y_fuse)
   Y_sel = ∑_i α_i·Y_i
   res  = W₄·GELU(W₃·y_fuse)     # reshape to H×W×C
   Y2   = Y_sel + res
4) X2 = X + Y2
5) X_out = X2 + MLP(LayerNorm(X2))
return X_out

In Lawin Transformer, large-window attention is employed within a spatial pyramid pooling framework (LawinASPP), using multi-scale receptive fields (e.g., $R\in\{2,4,8\}$ ) on top of hierarchical ViT encoders (Yan et al., 2022).

4. Computational Complexity and Parameterization

For DW-ViT MSW-MSA, the computational cost at each block is: $\Omega_{\mathrm{W-MSA}} \simeq 4NC^2 + 2NC M^2$

$\Omega_{\mathrm{MSW-MSA}} = 4NC^2 + \frac{2NC}{S}\sum_{i=1}^S w_i^2$

$\Omega_{\mathrm{DMSW}} = (1 + N(1 + 1/S))\,\frac{C^2}{S}$

Overhead relative to single window is $O(1)$ in $N$ , with parameters increasing by $O(C^2/S)$ for DMSW layers and $S$ bias matrices $(2w_i-1)^2$ . In Lawin Transformer, large-window attention branches induce a cost difference of $+(HW)P^2C$ per branch but total cost is comparable to stacked local windows, remaining efficient for practical $P\ll C$ (Yan et al., 2022).

5. Empirical Benchmark Performance

Multi-scale weight-sharing window attention consistently outperforms single-scale window transformers with minor resource increase.

Model	Params (M)	FLOPs (G)	ImageNet Top-1 (%)	ADE20K mIoU (%)	COCO AP_box / AP_mask
Swin-T [std]	28	4.5	81.3	44.5	46.0 / 41.6
DW-T	30	5.2	82.0 (+0.7)	45.7 (+1.2)	46.7 / 42.4 (+0.7/+0.8)
Swin-B [std]	88	15.4	83.3	48.1	48.5 / 43.4
DW-B	91	17.0	83.8 (+0.5)	48.7 (+0.6)	49.2 / 44.0 (+0.7/+0.6)

Lawin Transformer with large-window attention and multi-scale fusion yields ablation results indicating that each multi-scale branch (e.g., $R=2,4,8$ ) and global pooling contribute $0.4$- $1.0\%$ absolute mIoU (Yan et al., 2022). Removal of position-mixing MLPs results in up to $2.0\%$ performance drops, emphasizing their necessity for restoring fine spatial detail.

6. Relationship to Large-Window Attention and Multi-scale Pooling

Large-window attention within Lawin Transformer (Yan et al., 2022) operates by extending each window’s context via larger area queries—with average pooling and token-mixing MLP restoring spatial fidelity—yielding effective multi-scale context representations. Weight-sharing is realized across all windows at a given scale, and distinct trainable branches handle multi-scale aggregation. LawinASPP decoder generalizes atrous spatial pyramid pooling to multi-scale transformer attention, achieving efficient semantic segmentation.

In Lawin’s architecture, multi-level features (from hierarchical transformer encoder) are aggregated at several strides and fed through branches with receptive fields matching multiple scales and global context, each fused by simple concatenation and linear projection.

7. Context and Implications for Vision Transformers

Multi-scale weight-sharing window attention enables transformer backbones to capture both local and global dependencies with flexible windowing and efficient computation. The construction supports seamless integration into Swin-style architectures, scalable adaptation to input resolutions, and consistent empirical gains across image understanding tasks. This suggests that dynamic multi-scale modeling with adaptive fusion offers a generic replacement for fixed window transformers and convolutional pooling layers, while maintaining resource efficiency and deployment compatibility (Ren et al., 2022, Yan et al., 2022).