Token Pooling (TP) in Transformers

Updated 28 January 2026

Token Pooling (TP) is a set of techniques that aggregate token representations using methods like clustering, max/mean pooling, and dynamic segmentation to reduce sequence length and computational overhead.
Mathematical formulations in TP leverage pooling matrices and clustering algorithms to optimize token reduction, yielding significant FLOP savings and memory reductions in vision, language, and retrieval tasks.
TP integrates seamlessly into transformer architectures at varying stages, providing adaptive, parameter-efficient modules that enhance accuracy, robustness, and overall model performance.

Token Pooling (TP) encompasses a class of techniques that reduce, aggregate, or merge representations across sets of tokens in neural sequence models, with the dual aim of enhancing computational efficiency and controlling redundancy while often retaining or improving downstream task performance. TP is implemented through a variety of mechanisms—including hard and soft clustering, max/mean/generalized pooling, similarity-based merging, and learned dynamic segmentation—across domains such as vision transformers, LLMs, retrieval systems, and decentralized finance. These operations are typically inserted as explicit layers or modules within Transformer-style architectures, between blocks or at the output, often parameter-free or with minimal additional parameters.

1. Mathematical Formulations of Token Pooling

Across implementations, TP is formalized as an operator acting on a matrix of token representations $X\in\mathbb{R}^{n\times d}$ , outputting a shorter or pooled sequence $Y\in\mathbb{R}^{m\times d}$ , typically with $m<n$ . The reduction from $n$ to $m$ is accomplished either by aggregation—e.g., pooling, averaging, or taking a maximum over segments—or by selection/merging based on data-aware criteria.

Hierarchical Vision Transformers: In hierarchical vision transformers, such as HVT, TP is implemented via local pooling operators (e.g., 1D max-pooling with kernel $k$ and stride $s$ ), formalized as

$Y = PX$

where $P\in\mathbb{R}^{m\times n}$ is a pooling matrix, and after pooling, positional embeddings are added:

$\widetilde{Y} = Y + E_m$

Each subsequent stage processes the shortened sequence in the next block (Pan et al., 2021).

Clustering-Based Methods: In data-aware clustering variants, TP solves for cluster centers or medoids that minimize a reconstruction objective

$\min_{S_1,\dots,S_K} \sum_{k=1}^K \sum_{x_i \in S_k} \lVert x_i - c_k \rVert_2^2$

where $c_k = \frac{1}{|S_k|}\sum_{x_i\in S_k} x_i$ (Clavié et al., 2024, Marin et al., 2021).

Group and Generalized Pooling: In vision tasks, generalized mean (GeM) and group generalized mean (GGeM) pooling schemes generalize max and average pooling via

$v_d^{(g)} = \left(\frac{1}{|X_d|}\sum_{x\in X_d} x^{p_g}\right)^{1/p_g}$

with $p_g$ trainable (Ko et al., 2022).

2. Computational Motives and Efficiency Gains

The principal computational justification for TP is the reduction in model complexity by shrinking the sequence length, which typically lowers self-attention cost ( $O(m^2 d)$ for $m$ tokens) and the cost of projection and feed-forward layers.

For instance, in HVT, halving the number of tokens results in block-wise computational savings expressed as

$\phi_{n,d} = 12 n d^2 + 2 n^2 d \rightarrow \phi_{n/2,d} = 6 n d^2 + \frac{1}{2} n^2 d$

with a per-stage compression ratio $\alpha\in(2,4)$ (Pan et al., 2021). In practice, using TP allows scaling up model width, depth, or input resolution within a constant FLOP budget.

In vision transformers, over 80% of FLOPs are spent in the fully-connected and projection layers, so downsampling tokens between blocks provides multiplicative FLOP savings compared to attention-only approximations (Marin et al., 2021).

In ColBERT-style retrieval systems, clustering-based TP methods can reduce the number of stored vectors per document by up to 75% with minimal performance degradation, yielding significant memory and storage reductions (Clavié et al., 2024).

3. Data-Aware and Adaptive Pooling Strategies

Token pooling advances beyond uniform pooling by incorporating data-driven and dynamic strategies:

Clustering (K-means, hierarchical, medoids): Used in both vision (Marin et al., 2021) and retrieval (Clavié et al., 2024), the clustering-based approach selects representative tokens that minimize within-group or assignment-based reconstruction error, producing superior trade-offs compared to grid or uniform downsampling and score-based top-k methods.
Similarity-Based Merging: PPT utilizes bipartite soft matching (BSM) for pairwise merging of similar tokens. The most similar token pairs are combined, and token "sizes" are tracked for downstream attention reweighting (Wu et al., 2023).
Dynamic Pooling in LLMs: Sequence segmentation boundaries are predicted autoregressively, yielding variable-length segments that align with natural language units. Pooling and subsequent upsampling are performed according to learned or supervised segmentations, substantially improving bits-per-character and speed under a fixed computation budget (Nawrot et al., 2022).
Parameter-Free and Adaptive Fusion: Some approaches, such as PoNet, fuse multi-granularity global, segmental, and local pooling outputs, replacing self-attention to achieve $O(Nd^2)$ complexity with linear scaling in sequence length (Tan et al., 2021). In TAP, locally adaptive average pooling neighborhoods per token yield robustness to input corruptions (Guo et al., 2023).

4. Architectures and Integration Points

Token pooling may be injected at different architectural stages, including:

Between Transformer Blocks: In hierarchical models (HVT, PSViT), blocks are organized into stages separated by explicit pooling downsampling layers (Pan et al., 2021, Chen et al., 2021).
Within Select Blocks: PPT inserts pooling or pruning in specified blocks, guided by input-adaptive policies (Wu et al., 2023).
At Output: Pooling may replace the class ([CLS]) token at the network terminus. Average or max-pooled patch tokens yield improved classification performance over the [CLS] embedding (Pan et al., 2021, Behrendt et al., 21 May 2025).
During Index Time: In retrieval, TP is applied only at indexing, not query time, ensuring no inference overhead (Clavié et al., 2024).

5. Empirical Outcomes and Trade-Offs

Across domains and pooling techniques, TP provides consistent improvements:

Vision Transformers: Hierarchical TP (HVT) matches baseline FLOPs and improves ImageNet/CIFAR-100 top-1 by over 3 percentage points, with similar improvements in PSViT and PPT under equivalent or reduced FLOPs (Pan et al., 2021, Chen et al., 2021, Wu et al., 2023).
Retrieval: Clustering-based TP in ColBERT-style indexes attains reductions of 50–75% in vector count at <5% loss in retrieval metrics, and <1–2% for practical pooling factors (Clavié et al., 2024).
Language Modeling/Long Sequences: Dynamic TP in Transformers delivers lower perplexity and ~2.5× speedup for bits-per-character tasks, across diverse morphologies and languages (Nawrot et al., 2022).
Token Pooling vs. Token Pruning: TP yields demonstrably lower reconstruction error and superior trade-off than top-k importance-based pruning approaches (Marin et al., 2021).
Robustness: Adaptive pooling layers such as TAP significantly increase model robustness to input corruptions in both classification and segmentation (Guo et al., 2023).
Output Pooling: Aggregating patch tokens (average or group-wise GeM) rather than taking the [CLS] token alone yields higher discriminative power for classification (Ko et al., 2022, Pan et al., 2021, Behrendt et al., 21 May 2025).
Efficient Market Pooling: In decentralized exchanges, dynamic token pooling underpins single-pool AMMs that permit arbitrary asymmetric liquidity provision and self-balancing portfolios, offering lower slippage and improved utility over constant-weight pools—while introducing new economic attack surfaces (Kositwattanarerk, 30 Jul 2025).

6. Practical Considerations and Design Guidelines

Research offers detailed practical recommendations:

Clustering Details: Weighted K-medoids or hierarchical (Ward) clustering is preferred for TP in both vision and retrieval settings, with 4–5 iterations typically sufficient at practical token counts ( $n\leq256$ ) (Marin et al., 2021, Clavié et al., 2024).
Preserving [CLS]: For classification, always exclude the [CLS] token from pooling; post-pooling, aggregate patch tokens by average for maximal class discrimination (Pan et al., 2021, Behrendt et al., 21 May 2025).
Pooling Factor: p=2 or target halving tokens is a "safe" starting point for memory reduction with negligible loss; higher reductions may be leveraged with minimal added loss (Clavié et al., 2024).
Scheduling: Insert pooling only between major stages of a hierarchical model or at shallow layers of flat transformers; avoid pooling after every block to prevent loss of spatial resolution (Chen et al., 2021).
Parameterization: For group-wise adaptive pooling, align group count with network heads, and initialize generalized pooling exponents to 3–5 (Ko et al., 2022).

7. Limitations, Extensions, and Future Directions

Token pooling introduces certain limitations and open areas:

Clustering Overhead: Weighted K-medoids and hierarchical clustering scale quadratically in token count, but this is mitigated by small per-image sequence lengths typical in vision (Marin et al., 2021, Clavié et al., 2024).
Selection of Pooled Sequence Length: Static per-layer token count requires tuning or learned selection; dynamic per-input selection is proposed but not trivial (Marin et al., 2021).
Language Structure: For dynamic token pooling in LLMs, automatic boundary inference works better in some languages and morphological regimes than others, highlighting a need for further research in adaptive or linguistically-motivated segmenters (Nawrot et al., 2022).
Vulnerabilities in AMMs: Permitting one-sided (asymmetric) token pooling introduces flash-loan attack vectors, necessitating countermeasures such as delayed settlement or time-weighted exponents (Kositwattanarerk, 30 Jul 2025).
Robustness vs. Efficiency: While TP improves efficiency and sometimes robustness, overaggressive pooling can degrade performance—an optimal schedule and aggregation scheme remains domain- and architecture-dependent.

Token pooling continues to be a central mechanism for both scaling and compressing deep transformer models, offering interpretable, parameter-efficient alternatives to sparsification or architectural pruning while maintaining competitive downstream task accuracy, efficiency, and, in some cases, robustness (Pan et al., 2021, Marin et al., 2021, Ko et al., 2022, Wu et al., 2023, Clavié et al., 2024, Behrendt et al., 21 May 2025, Nawrot et al., 2022, Guo et al., 2023, Kositwattanarerk, 30 Jul 2025).

Markdown Upgrade to Chat

References (11)

Scalable Vision Transformers with Hierarchical Pooling (2021)

Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling (2024)

Token Pooling in Vision Transformers (2021)

Group Generalized Mean Pooling for Vision Transformer (2022)

PPT: Token Pruning and Pooling for Efficient Vision Transformers (2023)

Efficient Transformers with Dynamic Token Pooling (2022)

PoNet: Pooling Network for Efficient Token Mixing in Long Sequences (2021)

Robustifying Token Attention for Vision Transformers (2023)

PSViT: Better Vision Transformer via Token Pooling and Attention Sharing (2021)

10.

MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation (2025)

11.

Dynamic Exponent Market Maker: Personalized Portfolio Manager and One Pool to Trade Them All (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Token Pooling (TP).