Papers
Topics
Authors
Recent
Search
2000 character limit reached

Interaction-Centric Tokens (ICT) in Vision Transformers

Updated 28 May 2026
  • Interaction-Centric Tokens (ICT) are architectural components that concentrate local patch interactions into Super tokens for efficient global context exchange.
  • The STT framework employs window-based self-attention for local detail aggregation and a separable convolution mixer for reducing global computational complexity.
  • Empirical analysis on ImageNet shows that the ICT paradigm improves throughput and parameter efficiency compared to full-attention models.

Interaction-Centric Tokens (ICT) are a class of architectural components designed to efficiently centralize and mediate inter-token interactions within Transformer-based models for computer vision. The Super Token Transformer (STT), introduced by Mehta et al., formalizes the ICT concept through "Super tokens," which summarize local patch information and concentrate all global dependencies before redistributing them back to the patch level. This approach enables strict separation between local and global modeling, significantly reducing computational complexity while preserving isotropy and fine-grained representational capacity (Farooq et al., 2021).

1. Definition of Super Tokens as Interaction-Centric Tokens

Let an input image x∈RH×W×Cx \in \mathbb{R}^{H \times W \times C} pass through a convolutional stem to yield a feature map Xf∈Rh×w×dX^f \in \mathbb{R}^{h \times w \times d}, which is then flattened into N=hwN = hw patch tokens P={p1,...,pN}P = \{p_1, ..., p_N\} with pi∈Rdp_i \in \mathbb{R}^d. Patch tokens are partitioned into W=N/M2W = N/M^2 non-overlapping spatial windows, each of size M×MM \times M, where every window ii is assigned a unique learnable vector si∈Rds_i \in \mathbb{R}^d, the "Super token." At each layer ℓ\ell of the encoder, the Super token and its associated patch tokens form the window input Xf∈Rh×w×dX^f \in \mathbb{R}^{h \times w \times d}0.

The Super token Xf∈Rh×w×dX^f \in \mathbb{R}^{h \times w \times d}1 is both a local aggregator (summarizing patch information within its window) and the central mediator for global, cross-window interactions. After every local update, all Super tokens are collected for global mixing, with the results fed back in subsequent steps. This token-centric design ensures that inter-patch dependencies—and thus all computationally intensive interactions—are concentrated in Xf∈Rh×w×dX^f \in \mathbb{R}^{h \times w \times d}2 entities, making Super tokens strict instantiations of the ICT paradigm (Farooq et al., 2021).

2. Local Aggregation via Window-based Self-Attention

Within each window, local modeling is achieved through window-based multi-head self-attention (WMSA). Each Xf∈Rh×w×dX^f \in \mathbb{R}^{h \times w \times d}3 undergoes standard QKV projections, attention computation, and an MLP-based feed-forward network (FFN), including per-channel LayerScale parameters and LayerNorm:

Xf∈Rh×w×dX^f \in \mathbb{R}^{h \times w \times d}4

Xf∈Rh×w×dX^f \in \mathbb{R}^{h \times w \times d}5

The updated Super token for window Xf∈Rh×w×dX^f \in \mathbb{R}^{h \times w \times d}6 is the first row of Xf∈Rh×w×dX^f \in \mathbb{R}^{h \times w \times d}7, i.e., Xf∈Rh×w×dX^f \in \mathbb{R}^{h \times w \times d}8. Intra-window dependencies are thus fully integrated into the Super token at each layer.

This structure ensures retention of "fine-grained local detail" without patch merging, enabling per-window full-attention while minimizing global computational overhead (Farooq et al., 2021).

3. Global Dependency Modeling with Super Token Mixer

Following local aggregation, all Super tokens are assembled into a matrix Xf∈Rh×w×dX^f \in \mathbb{R}^{h \times w \times d}9. The global interaction step applies a separable-convolution "Super Token Mixer" (STM), combining depth-wise convolutions over tokens and point-wise convolutions over channels:

N=hwN = hw0

N=hwN = hw1

where N=hwN = hw2 and N=hwN = hw3 are learnable convolutional kernels. This global mixing is applied only to the N=hwN = hw4 Super tokens, decoupling the cost of global attention from N=hwN = hw5, the potentially large number of patch tokens.

Super tokens thus serve as the minimal "interaction carriers" (in the ICT sense), concentrating all inter-patch dependencies and relational information necessary for downstream global representation learning (Farooq et al., 2021).

4. Architectural Details: Isotropy, Layer Budget, and Token Flow

The Super Token Transformer (STT) maintains isotropy—constant embedding dimension—across all layers. After the initial convolutional stem (N=hwN = hw6 for N=hwN = hw7 with stride 8, thus N=hwN = hw8), windows of size N=hwN = hw9 (P={p1,...,pN}P = \{p_1, ..., p_N\}0) yield P={p1,...,pN}P = \{p_1, ..., p_N\}1 Super tokens. Each layer consists of two sub-blocks: a local WMSA+FFN operating on P={p1,...,pN}P = \{p_1, ..., p_N\}2 tokens per window, followed by STM for all P={p1,...,pN}P = \{p_1, ..., p_N\}3 Super tokens. Memory usage maintains P={p1,...,pN}P = \{p_1, ..., p_N\}4 tokens after each layer, with P={p1,...,pN}P = \{p_1, ..., p_N\}5.

In the final classification block, all P={p1,...,pN}P = \{p_1, ..., p_N\}6 patch tokens are discarded—global reasoning and the class embedding are learned with the P={p1,...,pN}P = \{p_1, ..., p_N\}7 Super tokens and one CLS token using two standard transformer layers (Farooq et al., 2021).

5. Computational Complexity and Throughput Analysis

STT's local module (WMSA) incurs computational cost P={p1,...,pN}P = \{p_1, ..., p_N\}8 per layer, as does windowed self-attention in other frameworks. The STM operates at P={p1,...,pN}P = \{p_1, ..., p_N\}9 and classification MSA at pi∈Rdp_i \in \mathbb{R}^d0. With representative settings (pi∈Rdp_i \in \mathbb{R}^d1, pi∈Rdp_i \in \mathbb{R}^d2, pi∈Rdp_i \in \mathbb{R}^d3, pi∈Rdp_i \in \mathbb{R}^d4), the cost is dominated by pi∈Rdp_i \in \mathbb{R}^d5, which is significantly less than the pi∈Rdp_i \in \mathbb{R}^d6 in full-image MSA.

Comparisons are as follows:

Model Params Throughput (img/s) Top-1 Acc. (%)
STT-S25 49M 735 83.5
Swin-B 88M 278 83.3
DeiT-B 87M 291 83.4

This configuration yields a drastic reduction in quadratic global attention cost, with STT achieving order-of-magnitude improvements in throughput and parameter efficiency relative to competitors at comparable accuracy (Farooq et al., 2021).

6. ICT Properties: Isolation, Efficiency, and Interpretability

Super tokens embody three defining properties of the ICT framework:

  1. Aggregation of local patch interaction: Each Super token compresses all patch-based interactions in its spatial window via WMSA.
  2. Mediation of global cross-window dependencies: Global mixing is limited to the Super tokens in STM, transforming a quadratic pi∈Rdp_i \in \mathbb{R}^d7 cost to a much smaller pi∈Rdp_i \in \mathbb{R}^d8 cost.
  3. Contextual redistribution: Updated Super tokens inject global context back into their respective windows in each subsequent layer, completing the interaction loop.

A key distinction is the retention of all patch tokens until the classifier stage—avoiding the loss of spatial resolution found in hierarchical or pyramidal designs. The isotropic nature (constant pi∈Rdp_i \in \mathbb{R}^d9) further simplifies analysis and model visualization (Farooq et al., 2021).

7. Significance, Performance, and Theoretical Implications

Super tokens operationalize the ICT paradigm by localizing all inter-patch and global dependencies within a minimal set of tokens, granting explicit architectural control over computational costs and global context exchange. Empirical results demonstrate that this leads to state-of-the-art performance and throughput on benchmarks such as ImageNet-1K, even with fewer parameters than hierarchical or isotropic full-attention baselines.

This suggests that strict token-centric mediation, as implemented in STT, can achieve favorable trade-offs between efficiency, representational fidelity, and architectural simplicity. The resulting framework is amenable to downstream analysis, interpretability, and potentially broader applications in visual recognition and sequence modeling (Farooq et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Interaction-Centric Tokens (ICT).