Window-Based Supertokens in Transformers

Updated 31 January 2026

Window-based supertokens are learnable CLS tokens prepended to each window that summarize local patch information for efficient global interactions.
The CLS attention mechanism employs local window encoding followed by cross-window aggregation to capture long-range dependencies via multi-head self-attention.
The Feature Inheritance Module and Spatial-Channel FFN ensure multi-scale context integration and computational efficiency, leading to robust performance in vision tasks.

Window-based supertokens refer to the learnable “CLS” tokens introduced into each window of a window-based transformer architecture, specifically in the Token Transformer (TT) model. These CLS tokens, termed supertokens, serve as window-level summaries and enable efficient long-range interactions within a hierarchical vision transformer framework. This mechanism enhances the vanilla window-based transformer by injecting a minimal-cost global pathway, thereby improving modeling of relationships across distant regions, while preserving efficiency and scalability (Mao et al., 2022).

1. Supertokens: Window-local Summary and Interaction

In the TT architecture, feature maps are decomposed into non-overlapping windows, each grouping $M$ patch tokens $z\in\mathbb{R}^{M\times C}$ . To each window, a learnable CLS token $\mathrm{cls}\in\mathbb{R}^{1\times C}$ is prepended, resulting in $\hat z = [\mathrm{cls},\,z]\in\mathbb{R}^{(M+1)\times C}$ . These CLS tokens are trained end-to-end and are uniquely assigned to window positions within every stage. Following one round of multi-head window-based self-attention (W-MSA), each CLS token summarizes its local window. These supertokens are then responsible for facilitating cross-window interaction in subsequent processing stages.

This organizational scheme retains the core computational benefits of window-based attention while providing each window with a condensed representation for downstream aggregation.

2. CLS Attention: Mechanism for Long-range Dependencies

CLS Attention in TT constitutes the principal component for long-range token interaction. The process involves two main steps:

a) Local Window Encoding:

Each window’s tokens, including the prepended CLS token, undergo W-MSA. Specifically, $\hat z = [\mathrm{cls},\,z]$ , $\hat z = \mathrm{W\!-\!MSA}(\mathrm{LN}(\hat z)) + \hat z$ , where LN denotes LayerNorm. The updated CLS token encodes local spatial information.

b) Cross-window Aggregation:

The CLS tokens collected from all $T$ windows are concatenated into $z_q = [\,\mathrm{cls}^{(1)},\ldots,\mathrm{cls}^{(T)}\,]\in\mathbb{R}^{T\times C}$ and serve as queries in multi-head cross-attention against the set of patch tokens $z\in\mathbb{R}^{(T\cdot M)\times C}$ . The cross-attention follows: $q = z_q W_q$ , $k = z W_k$ , $v = z W_v$ , with $o^{(n)}=\mathrm{Softmax}\left(\frac{q^{(n)}k^{(n)T}}{\sqrt d} + B\right)v^{(n)}$ , where $N$ is the number of heads and $B$ is a shared bias. This step enables each CLS token to attend globally, capturing dependencies across all windows. Algorithmic implementations are detailed in Algorithm 1 of the referenced paper.

CLS Attention thus establishes a low-cost, high-capacity pathway for modeling global interactions in hierarchical window-based architectures.

3. Feature Inheritance Module (FIM): Multi-scale Continuity

To preserve hierarchical modeling—where resolutions and window counts change across stages—the Feature Inheritance Module (FIM) merges the previous stage’s CLS tokens into the new ones. Let $T_{\text{old}}=H\times W$ , $T_{\text{new}}=(H/2)\times(W/2)$ , with $\mathrm{ori\_cls}$ and $\mathrm{new\_cls}$ as old and new supertokens, respectively:

Downsample: $\mathrm{ori\_cls}$ is reshaped and max-pooled (or convolved) to match the spatial shape of the next stage.
Merge:

Concatenate pooled $\mathrm{ori\_cls}$ with $\mathrm{new\_cls}$ along the channel dimension: $\tilde{\mathrm{cls}} = \mathrm{concat}[\mathrm{pooled\_ori\_cls},\,\mathrm{new\_cls}]$ .

Projection:

Apply a linear projection $W_c$ to form the updated new CLS tokens: $\mathrm{cls\_out} = \tilde{\mathrm{cls}} W_c$ .

This mechanism ensures newly introduced CLS supertokens inherit multi-scale context, yielding robustness in hierarchical feature aggregation.

4. Spatial-Channel Feedforward Network (SCFFN)

The conventional two-layer MLP feed-forward network (FFN) in transformers is replaced in TT with a lightweight Spatial-Channel FFN:

a) Spatial Mixing:

Apply an MLP along the sequence (token) dimension: $x' = \mathrm{MLP}(\mathrm{LN}(x)) + x$ .

b) Channel Mixing:

Transpose to $(B,C,N)$ and apply a $1\times1$ convolution with residual: $x'' = \mathrm{Conv}_{1\times1}(\mathrm{LN}(\mathrm{trans}(x'))) + \mathrm{trans}(x')$ .

c) Restore Output Shape:

Transpose back and concatenate with CLS token output.

The SCFFN incorporates both spatial and channel mixing with no additional parameters compared to a standard FFN, as the $1\times1$ convolution replaces the second MLP layer.

5. Computational Trade-offs and Complexity

Relative to Swin Transformer, TT introduces minimal computational overhead:

The addition of a CLS token increases each window's token count from $M$ to $M+1$ , incurring an extra $O(N_w M C)$ cost.
Cross-attention among supertokens and all patch-tokens is $O(N_w^2 M C)$ , with $N_w \approx \frac{HW}{M}$ , comparable to global attention over all tokens.
Empirically, TT-Tiny records 3.9 GFlops versus 4.5 G in Swin-T, with a small parameter increase of $+1$ –$2$ M.

This efficiency enables TT to scale effectively on standard vision tasks without prohibitive resource demands.

6. Experimental Benchmarks and Empirical Findings

On ImageNet-1k at $224\times224$ resolution:

Model	Params (M)	FLOPs (G)	Top-1 Acc (%)	$\Delta$ vs Swin
TT-T	25	3.9	82.5	+1.2
TT-S	47	7.7	83.7	+0.7
TT-B	86	14.6	84.2	+0.7

Further results:

COCO (Mask R-CNN 3×):
- Swin-T: AP $^b$ = 46.0, AP $^m$ = 41.6
- TT-T: AP $^b$ = 47.2, AP $^m$ = 42.6
ADE20K (UperNet):
- Swin-T: mIoU = 44.5
- TT-T: mIoU = 46.3

Ablations reveal:

Replacing CLS cross-attention with SW-MSA reduces Top-1 by 0.5%.
SCFFN outperforms standard FFN on both CLS and patch tokens by 0.3%.
Excluding FIM in stages impacts Top-1 by –0.3%.

These findings indicate that window-based supertokens are critical for the improved performance and efficiency of TT (Mao et al., 2022).

7. Visualization and Interpretability

Architectural visualizations (see Fig. 2 of (Mao et al., 2022)) demonstrate the hierarchical structure: PatchEmbed $\to$ TokenTransformer Blocks (with CLS & SCFFN) $\to$ FIM at boundaries. CLS Attention is depicted (Fig. 3a) as local window-based self-attention followed by multi-head cross-attention from supertokens, affirming their role in mediating window interactions. FIM’s flow (Fig. 4) highlights reshaping, downsampling, and merging of old and new CLS tokens. Attention maps (Fig. 7) show CLS tokens attending to semantically distant image regions, corroborating their effectiveness in modeling global dependencies.

A plausible implication is that window-based supertokens offer a general solution for extending local attention mechanisms with global modeling capabilities, maintaining efficiency and scalability within hierarchical transformer pipelines.

Markdown Report Issue Upgrade to Chat

References (1)

Token Transformer: Can class token help window-based transformer build better long-range interactions? (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Window-Based Supertokens.

Window-Based Supertokens in Transformers

1. Supertokens: Window-local Summary and Interaction

2. CLS Attention: Mechanism for Long-range Dependencies

3. Feature Inheritance Module (FIM): Multi-scale Continuity

4. Spatial-Channel Feedforward Network (SCFFN)

5. Computational Trade-offs and Complexity

6. Experimental Benchmarks and Empirical Findings

7. Visualization and Interpretability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Window-Based Supertokens in Transformers

1. Supertokens: Window-local Summary and Interaction

2. CLS Attention: Mechanism for Long-range Dependencies

3. Feature Inheritance Module (FIM): Multi-scale Continuity

4. Spatial-Channel Feedforward Network (SCFFN)

5. Computational Trade-offs and Complexity

6. Experimental Benchmarks and Empirical Findings

7. Visualization and Interpretability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research