Interaction-Centric Tokens (ICT) in Vision Transformers

Updated 28 May 2026

Interaction-Centric Tokens (ICT) are architectural components that concentrate local patch interactions into Super tokens for efficient global context exchange.
The STT framework employs window-based self-attention for local detail aggregation and a separable convolution mixer for reducing global computational complexity.
Empirical analysis on ImageNet shows that the ICT paradigm improves throughput and parameter efficiency compared to full-attention models.

Interaction-Centric Tokens (ICT) are a class of architectural components designed to efficiently centralize and mediate inter-token interactions within Transformer-based models for computer vision. The Super Token Transformer (STT), introduced by Mehta et al., formalizes the ICT concept through "Super tokens," which summarize local patch information and concentrate all global dependencies before redistributing them back to the patch level. This approach enables strict separation between local and global modeling, significantly reducing computational complexity while preserving isotropy and fine-grained representational capacity (Farooq et al., 2021).

1. Definition of Super Tokens as Interaction-Centric Tokens

Let an input image $x \in \mathbb{R}^{H \times W \times C}$ pass through a convolutional stem to yield a feature map $X^f \in \mathbb{R}^{h \times w \times d}$ , which is then flattened into $N = hw$ patch tokens $P = \{p_1, ..., p_N\}$ with $p_i \in \mathbb{R}^d$ . Patch tokens are partitioned into $W = N/M^2$ non-overlapping spatial windows, each of size $M \times M$ , where every window $i$ is assigned a unique learnable vector $s_i \in \mathbb{R}^d$ , the "Super token." At each layer $\ell$ of the encoder, the Super token and its associated patch tokens form the window input $X^f \in \mathbb{R}^{h \times w \times d}$ 0.

The Super token $X^f \in \mathbb{R}^{h \times w \times d}$ 1 is both a local aggregator (summarizing patch information within its window) and the central mediator for global, cross-window interactions. After every local update, all Super tokens are collected for global mixing, with the results fed back in subsequent steps. This token-centric design ensures that inter-patch dependencies—and thus all computationally intensive interactions—are concentrated in $X^f \in \mathbb{R}^{h \times w \times d}$ 2 entities, making Super tokens strict instantiations of the ICT paradigm (Farooq et al., 2021).

2. Local Aggregation via Window-based Self-Attention

Within each window, local modeling is achieved through window-based multi-head self-attention (WMSA). Each $X^f \in \mathbb{R}^{h \times w \times d}$ 3 undergoes standard QKV projections, attention computation, and an MLP-based feed-forward network (FFN), including per-channel LayerScale parameters and LayerNorm:

$X^f \in \mathbb{R}^{h \times w \times d}$ 4

$X^f \in \mathbb{R}^{h \times w \times d}$ 5

The updated Super token for window $X^f \in \mathbb{R}^{h \times w \times d}$ 6 is the first row of $X^f \in \mathbb{R}^{h \times w \times d}$ 7, i.e., $X^f \in \mathbb{R}^{h \times w \times d}$ 8. Intra-window dependencies are thus fully integrated into the Super token at each layer.

This structure ensures retention of "fine-grained local detail" without patch merging, enabling per-window full-attention while minimizing global computational overhead (Farooq et al., 2021).

3. Global Dependency Modeling with Super Token Mixer

Following local aggregation, all Super tokens are assembled into a matrix $X^f \in \mathbb{R}^{h \times w \times d}$ 9. The global interaction step applies a separable-convolution "Super Token Mixer" (STM), combining depth-wise convolutions over tokens and point-wise convolutions over channels:

$N = hw$ 0

$N = hw$ 1

where $N = hw$ 2 and $N = hw$ 3 are learnable convolutional kernels. This global mixing is applied only to the $N = hw$ 4 Super tokens, decoupling the cost of global attention from $N = hw$ 5, the potentially large number of patch tokens.

Super tokens thus serve as the minimal "interaction carriers" (in the ICT sense), concentrating all inter-patch dependencies and relational information necessary for downstream global representation learning (Farooq et al., 2021).

4. Architectural Details: Isotropy, Layer Budget, and Token Flow

The Super Token Transformer (STT) maintains isotropy—constant embedding dimension—across all layers. After the initial convolutional stem ( $N = hw$ 6 for $N = hw$ 7 with stride 8, thus $N = hw$ 8), windows of size $N = hw$ 9 ( $P = \{p_1, ..., p_N\}$ 0) yield $P = \{p_1, ..., p_N\}$ 1 Super tokens. Each layer consists of two sub-blocks: a local WMSA+FFN operating on $P = \{p_1, ..., p_N\}$ 2 tokens per window, followed by STM for all $P = \{p_1, ..., p_N\}$ 3 Super tokens. Memory usage maintains $P = \{p_1, ..., p_N\}$ 4 tokens after each layer, with $P = \{p_1, ..., p_N\}$ 5.

In the final classification block, all $P = \{p_1, ..., p_N\}$ 6 patch tokens are discarded—global reasoning and the class embedding are learned with the $P = \{p_1, ..., p_N\}$ 7 Super tokens and one CLS token using two standard transformer layers (Farooq et al., 2021).

5. Computational Complexity and Throughput Analysis

STT's local module (WMSA) incurs computational cost $P = \{p_1, ..., p_N\}$ 8 per layer, as does windowed self-attention in other frameworks. The STM operates at $P = \{p_1, ..., p_N\}$ 9 and classification MSA at $p_i \in \mathbb{R}^d$ 0. With representative settings ( $p_i \in \mathbb{R}^d$ 1, $p_i \in \mathbb{R}^d$ 2, $p_i \in \mathbb{R}^d$ 3, $p_i \in \mathbb{R}^d$ 4), the cost is dominated by $p_i \in \mathbb{R}^d$ 5, which is significantly less than the $p_i \in \mathbb{R}^d$ 6 in full-image MSA.

Comparisons are as follows:

Model	Params	Throughput (img/s)	Top-1 Acc. (%)
STT-S25	49M	735	83.5
Swin-B	88M	278	83.3
DeiT-B	87M	291	83.4

This configuration yields a drastic reduction in quadratic global attention cost, with STT achieving order-of-magnitude improvements in throughput and parameter efficiency relative to competitors at comparable accuracy (Farooq et al., 2021).

6. ICT Properties: Isolation, Efficiency, and Interpretability

Super tokens embody three defining properties of the ICT framework:

Aggregation of local patch interaction: Each Super token compresses all patch-based interactions in its spatial window via WMSA.
Mediation of global cross-window dependencies: Global mixing is limited to the Super tokens in STM, transforming a quadratic $p_i \in \mathbb{R}^d$ 7 cost to a much smaller $p_i \in \mathbb{R}^d$ 8 cost.
Contextual redistribution: Updated Super tokens inject global context back into their respective windows in each subsequent layer, completing the interaction loop.

A key distinction is the retention of all patch tokens until the classifier stage—avoiding the loss of spatial resolution found in hierarchical or pyramidal designs. The isotropic nature (constant $p_i \in \mathbb{R}^d$ 9) further simplifies analysis and model visualization (Farooq et al., 2021).

7. Significance, Performance, and Theoretical Implications

Super tokens operationalize the ICT paradigm by localizing all inter-patch and global dependencies within a minimal set of tokens, granting explicit architectural control over computational costs and global context exchange. Empirical results demonstrate that this leads to state-of-the-art performance and throughput on benchmarks such as ImageNet-1K, even with fewer parameters than hierarchical or isotropic full-attention baselines.

This suggests that strict token-centric mediation, as implemented in STT, can achieve favorable trade-offs between efficiency, representational fidelity, and architectural simplicity. The resulting framework is amenable to downstream analysis, interpretability, and potentially broader applications in visual recognition and sequence modeling (Farooq et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Global Interaction Modelling in Vision Transformer via Super Tokens (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Interaction-Centric Tokens (ICT).

Interaction-Centric Tokens (ICT) in Vision Transformers

1. Definition of Super Tokens as Interaction-Centric Tokens

2. Local Aggregation via Window-based Self-Attention

3. Global Dependency Modeling with Super Token Mixer

4. Architectural Details: Isotropy, Layer Budget, and Token Flow

5. Computational Complexity and Throughput Analysis

6. ICT Properties: Isolation, Efficiency, and Interpretability

7. Significance, Performance, and Theoretical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Interaction-Centric Tokens (ICT) in Vision Transformers

1. Definition of Super Tokens as Interaction-Centric Tokens

2. Local Aggregation via Window-based Self-Attention

3. Global Dependency Modeling with Super Token Mixer

4. Architectural Details: Isotropy, Layer Budget, and Token Flow

5. Computational Complexity and Throughput Analysis

6. ICT Properties: Isolation, Efficiency, and Interpretability

7. Significance, Performance, and Theoretical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research