Adaptive Local-Aware Token Pruning (ALTP)
- Adaptive Local-Aware Token Pruning (ALTP) is a framework that selectively preserves local object-centric details while pruning redundant tokens to reduce computational demands.
- It employs a two-stage process with superpixel-based uniform retention followed by dynamic token allocation driven by local information density and diversity metrics.
- Empirical results show ALTP achieves up to 90% token reduction with improved segmentation metrics and inference speed in complex, multi-image scenarios.
Adaptive Local-Aware Token Pruning (ALTP) is a framework for aggressive yet performance-preserving reduction of visual token sets in large multimodal models. Designed specifically for applications where local visual detail is critical, such as Grounded Conversation Generation (GCG), and scenarios featuring long visual contexts or multiple images, ALTP combines local region-aware token retention with adaptive allocation based on information density or token diversity. The approach improves efficiency and accuracy by ensuring that highly informative, object-centric visual features are retained after pruning, while eliminating tokens from redundant or background regions.
1. Motivation and Scope
Visual transformers and large multimodal models encode images into sequences of tokens, which, when numerous, sharply increase computational and memory demands. In settings like GCG—where models must associate free-form generated text with object-grounded segmentation masks—fine-grained and locally dense token representations are essential for accurate segmentation and semantic grounding. Standard token pruning techniques (e.g., FastV, PyramidDrop) that rely on global cross-attention fail in these domains: recall and AP metrics can deteriorate by up to 50% because local object cues are lost when tokens are dropped indiscriminately (Bai et al., 31 Mar 2025). Similarly, in long-context LMM deployments (multiple, possibly similar images per sequence), naive pruning exacerbates redundancy both within and across images, reducing efficiency without optimal utility savings (Zhang et al., 28 Dec 2025).
ALTP addresses these limitations by adaptively allocating token retention—first preserving local regions rich in detail, then dynamically assigning a global budget to the most informative regions or images. This ensures that vital object-centric information persists even under extreme global token reduction.
2. Key Concepts and Mathematical Formulation
ALTP's design is guided by two central principles: (1) detail-aware local retention and (2) adaptive budget allocation. In GCG-style tasks (Bai et al., 31 Mar 2025), ALTP uses superpixel segmentation (SLIC) to partition the visual field into coherent regions, allowing it to define region-specific token retention. For multi-image LMM inference (Zhang et al., 28 Dec 2025), ALTP differentiates intra-image (local) and inter-image (global) redundancy.
- Superpixel segmentation and local retention: An image is partitioned using SLIC into superpixels . Each superpixel maps to a token subset . A minimum quota of tokens per superpixel,
is always retained in region .
- Information density for adaptive allocation: For each superpixel, density is defined as
where is color variance, and counts pixels. Weights are obtained via tempered softmax:
with 0. Each 1 is then assigned a token budget
2
for 3 the global keep-ratio and 4 initial token count.
- Intra- and inter-image diversity: For multi-image input, intra-image token diversity is measured by average pairwise cosine distance
5
averaged across images. Inter-image variation compares aligned tokens in order,
6
with global mean 7 as their average. Dynamic token allocation is then made:
8
for 9.
3. Two-Stage Pruning Algorithms
ALTP uses a multi-stage pipeline to maximize the utility of pruned representations.
3.1 Local Detail Preservation (GCG Focus)
- Stage 1 (Detail Density Capture, DDC): Uniformly retains a baseline fraction of tokens per superpixel to guarantee object-relevant detail is present.
- Stage 2 (Dynamic Density Formation, DDF): Re-allocates the overall token budget proportionally to the useful detail (area-normalized color variance) found in each superpixel, counteracting over-concentration or under-utilization due to varying object sizes or scene layouts.
Local token selection within each region is based on the magnitude of local cross- or self-attention, maximizing retention of salient visual features.
3.2 Multi-Image Adaptive Pruning (Long Context LMMs)
- Stage 1 (Greedy Intra-Image Selection): For each image and its assigned budget, tokens are greedily selected to maximize Euclidean (cosine) dispersion, producing a subset with maximal diversity per image.
- Stage 2 (Global Diversity and Pareto Filtering): All per-image outputs are aggregated. A second round of greedy selection produces a globally diverse candidate pool. Each token's utility is then jointly assessed via:
- Inter-image diversity (0): mean dispersion with other tokens,
1 - Text alignment (2): negative squared 3 distance to the set of text-token embeddings,
4
A Pareto non-domination criterion selects the final set that is not strictly dominated in both diversity and alignment.
4. Implementation Details and Hyperparameters
Backbone compatibility: ALTP integrates flexibly with existing vision-language backbones. In GLaMM, it sits between the CLIP-ViT encoder and the downstream SAM-like decoder; in OMG-LLaVA, it prunes tokens before the cross-attention and segmentation output heads (Bai et al., 31 Mar 2025). For multi-image LMMs, ALTP prunes within the visual encoder pipeline before feeding pruned representations to the LLM (Zhang et al., 28 Dec 2025).
Pruning hyperparameters:
- Global keep-ratio 5: experiments use 6–7.
- Superpixel counts 8, compactness 9, softmax scale 0.
- For long-context variant: 1, 2, 3, 4.
- Computational aspects:
- SLIC segmentation overhead is typically negligible compared to transformer FLOPs, especially with GPU implementations.
- The two-stage greedy selection algorithm runs in 5 per image and 6 for global Pareto sorting.
- FLOPs reduction follows analytical form:
7
(layerwise vision transformer cost 8; 9 is the pruning location).
5. Empirical Evaluation and Performance
ALTP has demonstrated strong empirical results under extreme pruning conditions:
- GLaMM on GranDf: Reducing 0 to 1 tokens (2 pruning) improves AP50 from 3 (FastV), 4 (PyramidDrop) to 5 (ALTP). mIOU rises to 6 (7 pp over PyramidDrop) and Recall to 8 (+5.0 pp) (Bai et al., 31 Mar 2025).
- OMG-LLaVA on GranDf: Pruning from 9 to 0 tokens achieves AP50 1 (2 pp over PDrop), mIOU 3 (4 pp), Recall 5 (6 pp).
- Ablation: DDC alone is competitive (e.g., GLaMM, AP50 7), but adding DDF confers additional gains (8 pp AP50, 9 pp mIOU).
- Efficiency: For multi-image transformer models, ALTP maintains equivalent or superior prefill latencies (e.g., 809 ms vs. 810 ms for DivPrune at batch size 0–1 images) while reducing memory usage by 2 via 3 token count reduction (Zhang et al., 28 Dec 2025).
- Robustness: Even at a 4 token retention rate, ALTP maintains over 5 of the baseline performance on ALFRED and other benchmarks.
6. Insights, Limitations, and Prospects
ALTP achieves efficient inference acceleration without degrading, and frequently improving, segmentation fidelity and language grounding. By enforcing regional equilibria through DDC and dynamically assigning tokens via DDF, ALTP balances aggressive FLOPs reductions with retention of semantically critical details (Bai et al., 31 Mar 2025). High superpixel granularity (6) is consistently effective; with too few regions, the approach degenerates to non-local pruning.
Limitations include the extra segmentation step, which introduces minor extra computation but is not rate-limiting relative to transformer inference. Scenes densely packed with many small objects may require tuning 7 upward. Current ALTP implementations do not incorporate textual prompts for region weighting; prompt-aware pruning is a direction for future research.
For long-context LMMs, ALTP’s intra-/inter-redundancy decomposition makes it especially suitable for tasks with multiple, potentially repetitive images, dynamically matching the token budget with the content's novelty and saliency (Zhang et al., 28 Dec 2025).
7. Relationship to Prior Art
ALTP distinguishes itself from earlier token pruning approaches by explicitly prioritizing object-centric and locally dense visual regions, in contrast to global cross-attention-based deletion used by FastV, PyramidDrop, and similar methods. Whereas these global schemes often introduce catastrophic failures in tasks demanding spatial grounding, ALTP systematically recovers or surpasses baseline accuracy, even under 8–9 token pruning, thereby enabling up to 0 inference speedup with improved segmentation and grounding performance. The combination of local region quotas (DDC), adaptive density-driven token allocation (DDF), and multi-stage greedy/Pareto maximization is not present in previous pruning pipelines (Bai et al., 31 Mar 2025, Zhang et al., 28 Dec 2025).
A plausible implication is that similar local-aware, adaptively density-weighted token pruning strategies may be beneficial across a broader spectrum of multimodal and vision-language tasks, particularly in domains where fine-grained spatial reasoning and long-context reasoning co-occur.