Papers
Topics
Authors
Recent
Search
2000 character limit reached

Window-Based Supertokens in Transformers

Updated 31 January 2026
  • Window-based supertokens are learnable CLS tokens prepended to each window that summarize local patch information for efficient global interactions.
  • The CLS attention mechanism employs local window encoding followed by cross-window aggregation to capture long-range dependencies via multi-head self-attention.
  • The Feature Inheritance Module and Spatial-Channel FFN ensure multi-scale context integration and computational efficiency, leading to robust performance in vision tasks.

Window-based supertokens refer to the learnable “CLS” tokens introduced into each window of a window-based transformer architecture, specifically in the Token Transformer (TT) model. These CLS tokens, termed supertokens, serve as window-level summaries and enable efficient long-range interactions within a hierarchical vision transformer framework. This mechanism enhances the vanilla window-based transformer by injecting a minimal-cost global pathway, thereby improving modeling of relationships across distant regions, while preserving efficiency and scalability (Mao et al., 2022).

1. Supertokens: Window-local Summary and Interaction

In the TT architecture, feature maps are decomposed into non-overlapping windows, each grouping MM patch tokens zRM×Cz\in\mathbb{R}^{M\times C}. To each window, a learnable CLS token clsR1×C\mathrm{cls}\in\mathbb{R}^{1\times C} is prepended, resulting in z^=[cls,z]R(M+1)×C\hat z = [\mathrm{cls},\,z]\in\mathbb{R}^{(M+1)\times C}. These CLS tokens are trained end-to-end and are uniquely assigned to window positions within every stage. Following one round of multi-head window-based self-attention (W-MSA), each CLS token summarizes its local window. These supertokens are then responsible for facilitating cross-window interaction in subsequent processing stages.

This organizational scheme retains the core computational benefits of window-based attention while providing each window with a condensed representation for downstream aggregation.

2. CLS Attention: Mechanism for Long-range Dependencies

CLS Attention in TT constitutes the principal component for long-range token interaction. The process involves two main steps:

a) Local Window Encoding:

Each window’s tokens, including the prepended CLS token, undergo W-MSA. Specifically, z^=[cls,z]\hat z = [\mathrm{cls},\,z], z^=W ⁣ ⁣MSA(LN(z^))+z^\hat z = \mathrm{W\!-\!MSA}(\mathrm{LN}(\hat z)) + \hat z, where LN denotes LayerNorm. The updated CLS token encodes local spatial information.

b) Cross-window Aggregation:

The CLS tokens collected from all TT windows are concatenated into zq=[cls(1),,cls(T)]RT×Cz_q = [\,\mathrm{cls}^{(1)},\ldots,\mathrm{cls}^{(T)}\,]\in\mathbb{R}^{T\times C} and serve as queries in multi-head cross-attention against the set of patch tokens zR(TM)×Cz\in\mathbb{R}^{(T\cdot M)\times C}. The cross-attention follows: q=zqWqq = z_q W_q, k=zWkk = z W_k, v=zWvv = z W_v, with o(n)=Softmax(q(n)k(n)Td+B)v(n)o^{(n)}=\mathrm{Softmax}\left(\frac{q^{(n)}k^{(n)T}}{\sqrt d} + B\right)v^{(n)}, where NN is the number of heads and BB is a shared bias. This step enables each CLS token to attend globally, capturing dependencies across all windows. Algorithmic implementations are detailed in Algorithm 1 of the referenced paper.

CLS Attention thus establishes a low-cost, high-capacity pathway for modeling global interactions in hierarchical window-based architectures.

3. Feature Inheritance Module (FIM): Multi-scale Continuity

To preserve hierarchical modeling—where resolutions and window counts change across stages—the Feature Inheritance Module (FIM) merges the previous stage’s CLS tokens into the new ones. Let Told=H×WT_{\text{old}}=H\times W, Tnew=(H/2)×(W/2)T_{\text{new}}=(H/2)\times(W/2), with ori_cls\mathrm{ori\_cls} and new_cls\mathrm{new\_cls} as old and new supertokens, respectively:

  1. Downsample: ori_cls\mathrm{ori\_cls} is reshaped and max-pooled (or convolved) to match the spatial shape of the next stage.
  2. Merge:

Concatenate pooled ori_cls\mathrm{ori\_cls} with new_cls\mathrm{new\_cls} along the channel dimension: cls~=concat[pooled_ori_cls,new_cls]\tilde{\mathrm{cls}} = \mathrm{concat}[\mathrm{pooled\_ori\_cls},\,\mathrm{new\_cls}].

  1. Projection:

Apply a linear projection WcW_c to form the updated new CLS tokens: cls_out=cls~Wc\mathrm{cls\_out} = \tilde{\mathrm{cls}} W_c.

This mechanism ensures newly introduced CLS supertokens inherit multi-scale context, yielding robustness in hierarchical feature aggregation.

4. Spatial-Channel Feedforward Network (SCFFN)

The conventional two-layer MLP feed-forward network (FFN) in transformers is replaced in TT with a lightweight Spatial-Channel FFN:

a) Spatial Mixing:

Apply an MLP along the sequence (token) dimension: x=MLP(LN(x))+xx' = \mathrm{MLP}(\mathrm{LN}(x)) + x.

b) Channel Mixing:

Transpose to (B,C,N)(B,C,N) and apply a 1×11\times1 convolution with residual: x=Conv1×1(LN(trans(x)))+trans(x)x'' = \mathrm{Conv}_{1\times1}(\mathrm{LN}(\mathrm{trans}(x'))) + \mathrm{trans}(x').

c) Restore Output Shape:

Transpose back and concatenate with CLS token output.

The SCFFN incorporates both spatial and channel mixing with no additional parameters compared to a standard FFN, as the 1×11\times1 convolution replaces the second MLP layer.

5. Computational Trade-offs and Complexity

Relative to Swin Transformer, TT introduces minimal computational overhead:

  • The addition of a CLS token increases each window's token count from MM to M+1M+1, incurring an extra O(NwMC)O(N_w M C) cost.
  • Cross-attention among supertokens and all patch-tokens is O(Nw2MC)O(N_w^2 M C), with NwHWMN_w \approx \frac{HW}{M}, comparable to global attention over all tokens.
  • Empirically, TT-Tiny records 3.9 GFlops versus 4.5 G in Swin-T, with a small parameter increase of +1+1–$2$ M.

This efficiency enables TT to scale effectively on standard vision tasks without prohibitive resource demands.

6. Experimental Benchmarks and Empirical Findings

On ImageNet-1k at 224×224224\times224 resolution:

Model Params (M) FLOPs (G) Top-1 Acc (%) Δ\Delta vs Swin
TT-T 25 3.9 82.5 +1.2
TT-S 47 7.7 83.7 +0.7
TT-B 86 14.6 84.2 +0.7

Further results:

  • COCO (Mask R-CNN 3×):
    • Swin-T: APb^b = 46.0, APm^m = 41.6
    • TT-T: APb^b = 47.2, APm^m = 42.6
  • ADE20K (UperNet):
    • Swin-T: mIoU = 44.5
    • TT-T: mIoU = 46.3

Ablations reveal:

  • Replacing CLS cross-attention with SW-MSA reduces Top-1 by 0.5%.
  • SCFFN outperforms standard FFN on both CLS and patch tokens by 0.3%.
  • Excluding FIM in stages impacts Top-1 by –0.3%.

These findings indicate that window-based supertokens are critical for the improved performance and efficiency of TT (Mao et al., 2022).

7. Visualization and Interpretability

Architectural visualizations (see Fig. 2 of (Mao et al., 2022)) demonstrate the hierarchical structure: PatchEmbed \to TokenTransformer Blocks (with CLS & SCFFN) \to FIM at boundaries. CLS Attention is depicted (Fig. 3a) as local window-based self-attention followed by multi-head cross-attention from supertokens, affirming their role in mediating window interactions. FIM’s flow (Fig. 4) highlights reshaping, downsampling, and merging of old and new CLS tokens. Attention maps (Fig. 7) show CLS tokens attending to semantically distant image regions, corroborating their effectiveness in modeling global dependencies.

A plausible implication is that window-based supertokens offer a general solution for extending local attention mechanisms with global modeling capabilities, maintaining efficiency and scalability within hierarchical transformer pipelines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Window-Based Supertokens.