Token Aggregation Module

Updated 23 June 2026

Token aggregation modules are modular components in transformers that condense and fuse token embeddings to reduce redundancy while preserving semantic content.
They employ various strategies such as graph-based message passing, hierarchical clustering, and frequency-domain summarization to balance efficiency and accuracy.
These modules are adaptable to multiple modalities—vision, video, text, point clouds—enabling scalable, inference-optimized models with significant computational speedups.

A token aggregation module is a modular architectural component in transformer-based models whose purpose is to reduce, summarize, or fuse sets of token embeddings for memory/computation efficiency or improved downstream task performance. Such modules are now central in varied modalities—including vision, video, text, point cloud, and multimodal language/image models—where quadratic complexity or redundancy in the raw token space precludes direct scaling. A wide spectrum of aggregation strategies have emerged, including graph-based message passing, sparse/learned slot pooling, hierarchical or clustering-based reductions, frequency-domain summarization, probabilistic or optimal transport fusion, and attention-weighted fusion of rich token-level statistics. These modules may be learnable or parameter-free, static or dynamic, and positioned as architectural blocks or plug-and-play inference accelerators.

1. Core Taxonomy and Principles

Token aggregation modules operate by modifying the flow or number of token representations at key points in a model. Common principles include:

Redundancy Compression: Reducing the number of tokens by merging those that are similar by some metric (e.g., cosine similarity, learned affinity) while attempting to preserve crucial semantic content. Strategies range from parameter-free methods like agglomerative token clustering (Haurum et al., 2024) and temporal-spatial aggregation in video (Ren et al., 2023), to highly structured slot-based summarization (Chen et al., 1 Mar 2026).
Information Preservation: Unlike unstructured pruning, aggregation modules often propagate or accumulate feature information from dropped/merged tokens into those kept, mitigating information loss. This can be explicit, as in graph-based message passing (Jiang et al., 25 Aug 2025), or implicit via soft assignment and weighted pooling (Liu et al., 10 Mar 2025, Zeng et al., 19 May 2026).
Plug-and-Play or Learnable: Aggregation modules may be:
- Plug-and-play/inference-only: Require no retraining, e.g., post-hoc merging based on similarity or importance (Haurum et al., 2024, Jiang et al., 25 Aug 2025).
- Learnable/parameterized: Trained end-to-end to optimize task performance or dynamic token-to-slot routing (Liu et al., 10 Mar 2025, Chen et al., 1 Mar 2026, Zeng et al., 19 May 2026).
Fidelity–Efficiency Trade-off: The goal is to approximate the expressivity of full token calculations with much lower cost, finding an optimal balance between accuracy (e.g., classification top-1, retrieval R@1) and computational resources (FLOPs, latency, memory).
Modality-General and Application-Specific: While the aggregation concept is universal, customizations exist for video (spatiotemporal or bipartite merging), vision (content-aware clusters, multi-scale downsampling), point clouds (token–point cross-attention), and text (multi-token reliability).

2. Graph-Based and Attention-Driven Aggregation

Graph-based and attention-weighted aggregation represent a powerful, nonsymmetric token reduction paradigm:

Graph Summarization: VISA (Jiang et al., 25 Aug 2025) aggregates visual tokens in multimodal LLMs by constructing an undirected graph over image tokens, with edges weighted by cosine similarity. Tokens are partitioned into “kept” and “removed” sets. The aggregation operation is

$x^{\text{vis}}_k \leftarrow x^{\text{vis}}_k + \alpha\,(\hat G_A x^{\text{vis}}_r)$

where $\hat G_A$ is the symmetrically normalized adjacency between kept and removed tokens, and $\alpha$ controls aggregation strength. Tokens are selected for retention via text-guided attention averaged over layers and heads.

Hierarchical Clustering: Agglomerative Token Clustering (Haurum et al., 2024) hard-merges the most similar token pairs iteratively (using a choice of single, complete, or average linkage) until only a target number remain. The resulting embeddings are weighted averages, preserving salient features while drastically reducing quadratic attention cost.
Text-Guided Importance: VISA’s group-wise token selection scores each token’s importance $I_j$ by averaging text-to-visual attention across specific model layers, enabling the system to preserve semantically crucial tokens with respect to the query.

These approaches have demonstrated state-of-the-art efficiency–performance trade-offs in large multimodal models and vision backbones.

3. Clustering-/Slot-Based and Content-Aware Aggregation

Slot- and cluster-based modules structure tokens into groups that correspond to similar content or semantic regions:

Content-Aware Token Aggregation (CATA): CATANet (Liu et al., 10 Mar 2025) partitions tokens by cosine similarity to shared learnable centers (content-aware clusters). During training, cluster centers are refined by recursive hard assignment and mean recomputation with exponential moving average updates. At inference, each token is assigned to its most similar center, greatly reducing attention complexity. Subsequent intra-group self-attention and inter-group cross-attention yield global context at a fraction of the original quadratic cost.
Semantic Slot Aggregation: TC-SSA (Chen et al., 1 Mar 2026) addresses extreme-scale token reduction in gigapixel pathology by learning a small set of semantic slots. Patches are routed to slots via a softmax over slot similarities, sparsified by Top-2 assignment (each patch selects its two strongest slots). Weighted pooling and an MLP per slot produce the compact summary. Auxiliary losses regularize slot coverage and avoid collapse. This enables >60× compression with minimal diagnostic accuracy loss.
Optimal Transport and Weighted Aggregation: In large-scale visual place recognition, Weighted Aggregated Descriptor (WeiAD) (Zeng et al., 19 May 2026) obtains patch-to-cluster assignment via entropy-regularized optimal transport (Sinkhorn), then computes tiered, learnable weightings for each cluster, reflecting real-world heterogeneity in spatial/semantic structure.

These methods facilitate controllable token reduction, provide global context, and are empirically shown to outperform simple windowing or naive sampling.

4. Domain-Specific Token Aggregation Strategies

Token aggregation modules are tailored to the structural properties and efficiency bottlenecks of particular domains:

Video: Temporal-Spatial Token Aggregation TESTA (Ren et al., 2023) reduces spatiotemporal redundancy in long-form video by bipartite matching of most-similar frame (temporal) or patch (spatial) pairs, then merging by averaging. Merging only closely related tokens ensures preservation of dynamics and semantics, yielding up to 1.7× speedup and improved retrieval and question-answering accuracy.
Text: Multi-Token Reliability Aggregation In hallucination detection for vision-LLMs, the Multi-Token Reliability Estimation (MTRE) module (Zollicoffer et al., 16 May 2025) accumulates reliability judgments across the first $T$ tokens via a self-attention-based probe and a cumulative log-likelihood ratio:

$\Lambda^{(T)} = \sum_{\ell=1}^T \log \frac{p_\ell}{1 - p_\ell}$

This captures reliability signals that may only be apparent after observing several tokens, improving AUROC by 9–12 points over single-token methods.

Point Clouds: Token Representation and Relation Inference YOGO’s Relation Inference Module (Xu et al., 2021) groups raw points into tokens once via farthest point sampling, runs token self-attention, then projects back to per-point representations via cross-attention—avoiding repeated expensive grouping.

These system-level modules are precisely tuned for efficiency, memory, and semantic preservation under each modality’s constraints.

5. Frequency and Probabilistic/Spectral Domain Aggregation

Aggregation in the frequency or distributional domain adds complementary advantages:

Frequency-Domain Latent Attention Gating (FLaG) FLaG (Li et al., 6 Jun 2026) applies a real FFT to the sequence of token embeddings, uses learnable queries to cross-attend in frequency space, applies a channel-wise gate, then reconstructs the time-domain tokens for final pooling. The gate tends to preserve low-frequency content (>80% of total sensitivity) but allows sample-specific modulation of higher-frequency bands. Complexity scales as $O(TD\log T + TL D)$ .
Spectral Pooling Aggregation Modulation (SPAM) SPAM (Yun et al., 2023), in SPANet, decomposes features into frequency bands via 2D FFT. Tokens are modulated by learnable spatial-frequency masks balancing low/high pass aggregation. This mechanism achieves a better spectral balance than conventional self-attention and leads to superior classification and segmentation performance.
Probabilistic Token Aggregation Probabilistic approaches such as ProTA (Fang et al., 2024) represent each token as a Gaussian (mean and variance), and aggregate using 2-Wasserstein distance, producing more robust, diversity-preserving cross-modal alignment for text-video retrieval.

These modules leverage the statistical structure of token distributions to boost aggregation expressivity with modest computational overhead.

6. Training-Free, Learnable, and Flexible Aggregation Designs

The diversity of aggregation designs can be summarized as follows:

Family/Approach	Aggregation Principle	Learnable?	Exemplary Paper(s)
Graph-based (VTA, VISA)	Cosine graph, message passing	No	(Jiang et al., 25 Aug 2025)
Cluster/center (CATA, WeiAD)	Hard/soft cluster assignment, aggregation	Yes	(Liu et al., 10 Mar 2025, Zeng et al., 19 May 2026)
Agglomerative (ATC)	Hierarchical clustering, hard merging	No	(Haurum et al., 2024)
Slot/semantic slot (TC-SSA)	Learnable slots, sparse routing	Yes	(Chen et al., 1 Mar 2026)
Self-attention (MaxPoolBERT, MTRE)	Attention-weighted pooling	Yes	(Behrendt et al., 21 May 2025, Zollicoffer et al., 16 May 2025)
Frequency/spectrum (FLaG, SPAM)	Frequency space summarization	Yes	(Li et al., 6 Jun 2026, Yun et al., 2023)
Token-once grouping (YOGO)	Early global grouping and point-to-token/point mapping	No	(Xu et al., 2021)
Probabilistic/Gaussian (ProTA)	Distribution-based kernel aggregation	Yes	(Fang et al., 2024)

Learnable methods are often trained end-to-end with task loss (and possibly auxiliary regularization for slot coverage, entropy, or cluster balancing), while plug-and-play modules require no retraining and are appealing as post-hoc inference speedups.

Flexible frameworks such as WeiToP (Zeng et al., 19 May 2026) enable dynamic, inference-time token pruning without retraining—allowing the trade-off between accuracy and latency to be modulated on demand, in contrast to classical static compression.

7. Empirical Impact and Task-Specific Evaluations

Token aggregation modules are directly validated on multiple challenging benchmarks and use cases:

Vision-Language and Multimodal: VISA (Jiang et al., 25 Aug 2025) achieves >98% of baseline LLaVA-7B/13B accuracy at 40–60% FLOPs, with up to +80% throughput improvement in RTX 3090 settings.
Vision: Super-Resolution: CATANet (Liu et al., 10 Mar 2025) yields up to +0.33 dB PSNR and ≈5× speedup versus cluster-based baselines on Urban100.
Video Understanding: TESTA (Ren et al., 2023) delivers +13.7% R@1 on QuerYD with 75% token reduction.
Text/Language: MaxPoolBERT (Behrendt et al., 21 May 2025) increases GLUE score by +1.25 points (up to +8 on low-resource WNLI) with only a minimal head overhead.
Point Clouds: YOGO RIM (Xu et al., 2021) achieves ≥3× speedup versus PointNet++ while maintaining competitive accuracy.
Pathology Gigapixel Reasoning: TC-SSA (Chen et al., 1 Mar 2026) compresses 10⁵+ patches to 32 slots (1.7% budget), improving accuracy over the best patch sampling baselines by +10.6%.
Visual Place Recognition: WeiToP (Zeng et al., 19 May 2026) halves latency at only ≈5% absolute accuracy loss (recall@1=75% at 50% token retention), outperforming generic token-pruning approaches.

These empirical results demonstrate that token aggregation modules can mitigate the scalability bottlenecks of transformer-based architectures without sacrificing, and often improving, crucial task performance metrics.

Token aggregation modules have become a critical enabling technology for efficient, scalable, and semantically robust transformer architectures across modalities. Innovations continue to broaden their flexibility, interpretability, and domain-specific performance, with ongoing research focusing on tighter integration with adaptive attention, information-theoretic objectives, and resource-aware optimization (Jiang et al., 25 Aug 2025, Liu et al., 10 Mar 2025, Ren et al., 2023, Haurum et al., 2024, Zollicoffer et al., 16 May 2025, Li et al., 6 Jun 2026, Behrendt et al., 21 May 2025, Chen et al., 1 Mar 2026, Zeng et al., 19 May 2026, Fang et al., 2024, Yun et al., 2023, Xu et al., 2021).