Token Pooling in Neural Networks

Updated 19 February 2026

Token pooling is an operation that aggregates token embeddings via mean, max, or learnable weights to form compact global representations or reduce sequence length.
It is applied across text, vision, and speech models, employing both parameter-free and adaptive strategies to improve scalability and task-specific performance.
Advanced techniques like ATA, PMA, and clustering-based pooling optimize information preservation, address over-squashing, and balance efficiency with accuracy.

Token pooling is a class of operations in neural network architectures—most notably Transformers—that aggregate or reduce sets of token-level embeddings, either to produce compact global representations (sequence embedding) or to efficiently reduce sequence length (downsampling) for scalability, storage, or compute efficiency. Token pooling appears in encoder and decoder models for language, vision, and speech, and spans both parameter-free and learnable strategies. Its design and placement are central to information preservation, inductive bias, computational scaling, and task-specific performance in dense retrieval, classification, multi-vector retrieval, and hierarchical modeling.

1. Fundamental Pooling Strategies and Mathematical Formulations

Token pooling methods are defined by how they aggregate hidden states $H = \{h_1, \ldots, h_K\}$ ( $h_i \in \mathbb{R}^d$ ) into one or more output vectors. Classic pooling strategies include:

Mean Pooling: $z = \frac{1}{K} \sum_{i=1}^K h_i$ . This is parameter-free, invariant to token order, and serves as a baseline in both text and vision models. It can dilute salient signals by treating stopwords or uninformative background equally with informative tokens (Pan et al., 31 Aug 2025).
Last-token/CLS Pooling: $z = h_K$ (last token) or $z = h_{[CLS]}$ (first, special aggregate token), as employed in BERT. This preserves causal structure (in decoder models) but risks “over-squashing”—distant tokens have vanishing influence, and salient local evidence can be lost, particularly for long sequences (Ding et al., 18 Nov 2025).
Max Pooling: $z_j = \max_{i} h_{i,j}$ . Enhances local salience by capturing dominant activations, but discards finer token interactions (Behrendt et al., 21 May 2025).
Weighted Pooling: $z = \sum_i w_i h_i$ , with $w \in \Delta^K$ . $w$ can be static, hand-crafted (e.g., linearly increasing), or dynamically computed (attention-driven or learned). Examples include Anchor Token Aware (ATA) pooling via attention-based weighting, and learnable pooling queries as in PMA (Pan et al., 31 Aug 2025, Qin et al., 24 Dec 2025).
Clustering/Pooling by Reduction: Groups similar tokens (e.g., via k-means/hierarchical clustering) and replaces groups with centroid vectors, aggressively reducing the number of tokens for memory/compute efficiency (Wu et al., 2023, Clavié et al., 2024).

Specialized pooling mechanisms include group-wise operations (e.g., Group Generalized Mean, GGeM, in ViT (Ko et al., 2022)), frequency-domain pooling (SPAM in SPANet (Yun et al., 2023)), and adaptive/instance-aware methods (e.g., ContextPool (Huang et al., 2022) or dynamic pooling (Nawrot et al., 2022)).

2. Advanced Attention-driven and Learnable Pooling Operators

Recent work emphasizes parameter-free or lightly-parameterized pooling that adapts to content and model structure:

Anchor Token Aware (ATA) Pooling: Weights tokens by the total attention they receive in the last layer:

$w_i = \sum_{h=1}^{H_{head}} \sum_{j=1}^K \log(K \cdot a_{ij}^h + 1), \quad \tilde{w}_i = \frac{w_i}{\sum_j w_j}, \quad v = \sum_i \tilde{w}_i h_i.$

Here, tokens that act as “hubs” for attention (anchor tokens) are emphasized, improving embedding quality without extra parameters (Pan et al., 31 Aug 2025).

Pooling by Multihead Attention (PMA): Introduces a (possibly learnable) query vector $q$ , forming

$Q = q W_q, \quad K = H W_k, \quad V = H W_v;\quad E = \text{softmax}(Q K^T) V.$

This enables flexible embedding sizes and allows the aggregation to focus on salient tokens, outperforming EOS-based and mean-pooling in code embedding (Qin et al., 24 Dec 2025).

ContextPool and TAP: Learn pooling weights and contexts (e.g., via local/global MLPs, multi-dilated averaging), often in a content- and token-adaptive fashion, to enrich feature representations and stabilize attention (Huang et al., 2022, Guo et al., 2023).

3. Pooling as Sequence Reduction: Downsampling, Merging, and Clustering

Pooling mechanisms can serve to reduce the sequence length within or between transformer layers, directly impacting computational and memory complexity:

Token Pooling in Vision Transformers: Algorithms such as bipartite soft matching (PPT) or clustering-based techniques merge highly similar tokens (e.g., image patches with similar features) into size-weighted representatives, applied adaptively at selected layers. This achieves significant FLOP and memory reductions (e.g., 37% in DeiT-S) without sacrificing accuracy (Wu et al., 2023).
Hierarchical Pooling: Arranges transformers into stages with intermediate downsampling, emulating CNN feature pyramids. Each pool shrinks the sequence, reducing quadratic attention cost. For instance, HVT’s average pooling after each stage discards the CLS token in favor of aggregating all remaining tokens (Pan et al., 2021).
Clustering-based Pooling in Multi-Vector Retrieval: For IR models such as ColBERT, token pooling via k-means or hierarchical clustering can reduce stored document vectors by 50–75%, with minimal retrieval performance loss (≤3%), by replacing local, redundant token embeddings with centroids (Clavié et al., 2024).

4. Information Aggregation, Inductive Bias, and Information Flow

Pooling design fundamentally affects the inductive bias and semantic information available to downstream tasks:

Over-squashing and Information Flow: Causal transformers utilizing last-token or EOS pooling suffer from gradient attenuation with depth and sequence length—early tokens’ influences decay rapidly. Mean pooling or explicit landmark/anchor-based pooling ensures broader aggregation, improving retrieval, classification, and long-context recall (Ding et al., 18 Nov 2025, Doshi et al., 29 Jan 2026).
Landmark Pooling (LMK): By partitioning sequences and inserting explicit landmark tokens whose embeddings are mean-pooled, LMK pooling balances local salience and global context. This avoids the CLS over-representation of early positions (due to rotary positional embeddings) and mean-pooling’s signal dilution, achieving superior long-context extrapolation and robust short-context retrieval (Doshi et al., 29 Jan 2026).
Pooling Granularity in PoNet: Multi-granularity pooling (global, segment, local) in token mixing allows capturing context at various scales, with pooling fusion combining these signals for each token. Ablations show that all scales contribute to generalization and accuracy, particularly in long-sequence contexts (Tan et al., 2021).

5. Architectural Variants and Hybrid Pooling for Task Requirements

Vision Transformers (ViT, DeiT, PoolFormer, SPANet): Token pooling serves both as geometric downsampling (max/average pooling, convolutional stride) and as a channel-wise aggregation operator (GGeM, spectral pooling). Instance- and channel-adaptive pooling further leverages multi-head structures and spectral properties for improved accuracy and robustness (Ko et al., 2022, Yun et al., 2023).
Audio and Speech Models (HM-Conformer): Hierarchical pooling layers inserted into Conformer encoders (post-conv subsampling) reduce redundancy and focus model depth on increasingly global cues. Pooling, in conjunction with multi-level CLS token aggregation, strengthens spoofing detection performance, with ablation confirming additive contributions from both components (Shin et al., 2023).
Token Pooling in DeFi (R-Pool): In smart contract pooling (e.g., ERC-20R settlement), "token pooling" references the aggregation of settled/unsettled assets, risk-compensated LP share computation, and application of risk-adjusted exchange rates. This context utilizes pooling for automated market-making and settlement efficiency (Wang et al., 2023).

6. Efficiency, Empirical Performance, and Trade-offs

Pooling schemes are selected and tuned for a balance between efficiency and representational power:

Pooling Strategy	Compute Overhead	Accuracy Impact	Key Use-case Example
Mean/CLS/Max pooling	Minimal	Baseline; diluted/squashed	Vanilla BERT, ViT
ATA / attention pooling	O(H_head·K²) extra	+0.9 pts over last-token	MTEB tasks, LLM embeddings (Pan et al., 31 Aug 2025)
PMA / learned pooling	Single attention	SOTA code retrieval	C2LLM-7B (Qin et al., 24 Dec 2025)
Clustering-based	Only at index time	≤3% drop for p=3/4	ColBERT/ColBERT-v2 retrieval (Clavié et al., 2024)
Instance-adaptive BSM	Negligible runtime	Maintains accuracy, 45% ↑ TPS	PPT, Vision Transformers (Wu et al., 2023)
Landmark pooling	+0.8% tokens	Strong in long contexts	Dense retrieval, MLDR, BEIR (Doshi et al., 29 Jan 2026)

Empirically, pooling schemes such as ATA or PMA yield measurable but nontrivial improvements (e.g., +0.5–1pt MTEB), and pooling by clustering can halve vector storage with virtually no degradation up to moderate pooling ratios (Pan et al., 31 Aug 2025, Qin et al., 24 Dec 2025, Clavié et al., 2024). Adaptive and multiscale pooling operators (e.g., in PoNet, PPT, SPANet) deliver superior cost-accuracy trade-offs, robustness, and enable scaling to longer sequences (Tan et al., 2021, Wu et al., 2023, Yun et al., 2023).

7. Best Practices, Limitations, and Sensitivity Analyses

Adaptive weighting schemes (e.g., ATA, focal/landmark tokens, PMA) are preferred over static aggregation for generalization and robustness across sequence lengths and semantic domains (Pan et al., 31 Aug 2025, Doshi et al., 29 Jan 2026).
Instance- and content-aware hybrid approaches that select between pooling and pruning, or adapt pooling span and weighting, consistently outperform single-strategy baselines, especially under FLOP or memory constraints (Wu et al., 2023, Marin et al., 2021, Huang et al., 2022).
Pooling design directly impacts overfitting, bias towards sequence ends, and ability to encode distributed evidence; excessive pooling (e.g., p≥6 in ColBERT-like retrieval) eventually impairs performance, but p=2 or 3 delivers major efficiency gains at sub-3% loss (Clavié et al., 2024).
Token pooling schemes should generally integrate with task-specific downstream needs: compress token sets aggressively only where redundancy is high and information loss is negligible for the application (e.g., shallow layers or redundant patch tokens in vision), while maximizing contextual aggregation in text embeddings, or maximizing discriminative power in sequence classification (Wu et al., 2023, Ko et al., 2022, Pan et al., 31 Aug 2025).

Pooling remains a cornerstone for information aggregation and scalability in neural sequence models. Ongoing developments integrate content-adaptive, spectral, clustering, and cross-modal ideas to maximize both representation quality and efficiency.