Window-Based Supertokens in Transformers
- Window-based supertokens are learnable CLS tokens prepended to each window that summarize local patch information for efficient global interactions.
- The CLS attention mechanism employs local window encoding followed by cross-window aggregation to capture long-range dependencies via multi-head self-attention.
- The Feature Inheritance Module and Spatial-Channel FFN ensure multi-scale context integration and computational efficiency, leading to robust performance in vision tasks.
Window-based supertokens refer to the learnable “CLS” tokens introduced into each window of a window-based transformer architecture, specifically in the Token Transformer (TT) model. These CLS tokens, termed supertokens, serve as window-level summaries and enable efficient long-range interactions within a hierarchical vision transformer framework. This mechanism enhances the vanilla window-based transformer by injecting a minimal-cost global pathway, thereby improving modeling of relationships across distant regions, while preserving efficiency and scalability (Mao et al., 2022).
1. Supertokens: Window-local Summary and Interaction
In the TT architecture, feature maps are decomposed into non-overlapping windows, each grouping patch tokens . To each window, a learnable CLS token is prepended, resulting in . These CLS tokens are trained end-to-end and are uniquely assigned to window positions within every stage. Following one round of multi-head window-based self-attention (W-MSA), each CLS token summarizes its local window. These supertokens are then responsible for facilitating cross-window interaction in subsequent processing stages.
This organizational scheme retains the core computational benefits of window-based attention while providing each window with a condensed representation for downstream aggregation.
2. CLS Attention: Mechanism for Long-range Dependencies
CLS Attention in TT constitutes the principal component for long-range token interaction. The process involves two main steps:
a) Local Window Encoding:
Each window’s tokens, including the prepended CLS token, undergo W-MSA. Specifically, , , where LN denotes LayerNorm. The updated CLS token encodes local spatial information.
b) Cross-window Aggregation:
The CLS tokens collected from all windows are concatenated into and serve as queries in multi-head cross-attention against the set of patch tokens . The cross-attention follows: , , , with , where is the number of heads and is a shared bias. This step enables each CLS token to attend globally, capturing dependencies across all windows. Algorithmic implementations are detailed in Algorithm 1 of the referenced paper.
CLS Attention thus establishes a low-cost, high-capacity pathway for modeling global interactions in hierarchical window-based architectures.
3. Feature Inheritance Module (FIM): Multi-scale Continuity
To preserve hierarchical modeling—where resolutions and window counts change across stages—the Feature Inheritance Module (FIM) merges the previous stage’s CLS tokens into the new ones. Let , , with and as old and new supertokens, respectively:
- Downsample: is reshaped and max-pooled (or convolved) to match the spatial shape of the next stage.
- Merge:
Concatenate pooled with along the channel dimension: .
- Projection:
Apply a linear projection to form the updated new CLS tokens: .
This mechanism ensures newly introduced CLS supertokens inherit multi-scale context, yielding robustness in hierarchical feature aggregation.
4. Spatial-Channel Feedforward Network (SCFFN)
The conventional two-layer MLP feed-forward network (FFN) in transformers is replaced in TT with a lightweight Spatial-Channel FFN:
a) Spatial Mixing:
Apply an MLP along the sequence (token) dimension: .
b) Channel Mixing:
Transpose to and apply a convolution with residual: .
c) Restore Output Shape:
Transpose back and concatenate with CLS token output.
The SCFFN incorporates both spatial and channel mixing with no additional parameters compared to a standard FFN, as the convolution replaces the second MLP layer.
5. Computational Trade-offs and Complexity
Relative to Swin Transformer, TT introduces minimal computational overhead:
- The addition of a CLS token increases each window's token count from to , incurring an extra cost.
- Cross-attention among supertokens and all patch-tokens is , with , comparable to global attention over all tokens.
- Empirically, TT-Tiny records 3.9 GFlops versus 4.5 G in Swin-T, with a small parameter increase of –$2$ M.
This efficiency enables TT to scale effectively on standard vision tasks without prohibitive resource demands.
6. Experimental Benchmarks and Empirical Findings
On ImageNet-1k at resolution:
| Model | Params (M) | FLOPs (G) | Top-1 Acc (%) | vs Swin |
|---|---|---|---|---|
| TT-T | 25 | 3.9 | 82.5 | +1.2 |
| TT-S | 47 | 7.7 | 83.7 | +0.7 |
| TT-B | 86 | 14.6 | 84.2 | +0.7 |
Further results:
- COCO (Mask R-CNN 3×):
- Swin-T: AP = 46.0, AP = 41.6
- TT-T: AP = 47.2, AP = 42.6
- ADE20K (UperNet):
- Swin-T: mIoU = 44.5
- TT-T: mIoU = 46.3
Ablations reveal:
- Replacing CLS cross-attention with SW-MSA reduces Top-1 by 0.5%.
- SCFFN outperforms standard FFN on both CLS and patch tokens by 0.3%.
- Excluding FIM in stages impacts Top-1 by –0.3%.
These findings indicate that window-based supertokens are critical for the improved performance and efficiency of TT (Mao et al., 2022).
7. Visualization and Interpretability
Architectural visualizations (see Fig. 2 of (Mao et al., 2022)) demonstrate the hierarchical structure: PatchEmbed TokenTransformer Blocks (with CLS & SCFFN) FIM at boundaries. CLS Attention is depicted (Fig. 3a) as local window-based self-attention followed by multi-head cross-attention from supertokens, affirming their role in mediating window interactions. FIM’s flow (Fig. 4) highlights reshaping, downsampling, and merging of old and new CLS tokens. Attention maps (Fig. 7) show CLS tokens attending to semantically distant image regions, corroborating their effectiveness in modeling global dependencies.
A plausible implication is that window-based supertokens offer a general solution for extending local attention mechanisms with global modeling capabilities, maintaining efficiency and scalability within hierarchical transformer pipelines.