Adaptive Local-Aware Token Pruning (ALTP)

Updated 7 April 2026

Adaptive Local-Aware Token Pruning (ALTP) is a framework that selectively preserves local object-centric details while pruning redundant tokens to reduce computational demands.
It employs a two-stage process with superpixel-based uniform retention followed by dynamic token allocation driven by local information density and diversity metrics.
Empirical results show ALTP achieves up to 90% token reduction with improved segmentation metrics and inference speed in complex, multi-image scenarios.

Adaptive Local-Aware Token Pruning (ALTP) is a framework for aggressive yet performance-preserving reduction of visual token sets in large multimodal models. Designed specifically for applications where local visual detail is critical, such as Grounded Conversation Generation (GCG), and scenarios featuring long visual contexts or multiple images, ALTP combines local region-aware token retention with adaptive allocation based on information density or token diversity. The approach improves efficiency and accuracy by ensuring that highly informative, object-centric visual features are retained after pruning, while eliminating tokens from redundant or background regions.

1. Motivation and Scope

Visual transformers and large multimodal models encode images into sequences of tokens, which, when numerous, sharply increase computational and memory demands. In settings like GCG—where models must associate free-form generated text with object-grounded segmentation masks—fine-grained and locally dense token representations are essential for accurate segmentation and semantic grounding. Standard token pruning techniques (e.g., FastV, PyramidDrop) that rely on global cross-attention fail in these domains: recall and AP metrics can deteriorate by up to 50% because local object cues are lost when tokens are dropped indiscriminately (Bai et al., 31 Mar 2025). Similarly, in long-context LMM deployments (multiple, possibly similar images per sequence), naive pruning exacerbates redundancy both within and across images, reducing efficiency without optimal utility savings (Zhang et al., 28 Dec 2025).

ALTP addresses these limitations by adaptively allocating token retention—first preserving local regions rich in detail, then dynamically assigning a global budget to the most informative regions or images. This ensures that vital object-centric information persists even under extreme global token reduction.

2. Key Concepts and Mathematical Formulation

ALTP's design is guided by two central principles: (1) detail-aware local retention and (2) adaptive budget allocation. In GCG-style tasks (Bai et al., 31 Mar 2025), ALTP uses superpixel segmentation (SLIC) to partition the visual field into coherent regions, allowing it to define region-specific token retention. For multi-image LMM inference (Zhang et al., 28 Dec 2025), ALTP differentiates intra-image (local) and inter-image (global) redundancy.

Superpixel segmentation and local retention: An image is partitioned using SLIC into $K$ superpixels $\{\mathcal S_1,\dots,\mathcal S_K\}$ . Each superpixel maps to a token subset $\Omega_k$ . A minimum quota of tokens per superpixel,

$T_k^{(\mathrm{DDC})} = \lceil r_k \cdot |\Omega_k| \rceil,$

is always retained in region $k$ .

Information density for adaptive allocation: For each superpixel, density is defined as

$d_k = \mathrm{Var}(\mathcal S_k)\sqrt{\frac{|\mathcal P_k|}{|\mathcal P_{\mathrm{total}}|}}$

where $\mathrm{Var}(\mathcal S_k)$ is color variance, and $|\mathcal P_k|$ counts pixels. Weights $w_k$ are obtained via tempered softmax:

$w_k = \frac{\exp(d_k/(\alpha \max_j d_j))}{\sum_{j=1}^K \exp(d_j/(\alpha \max_j d_j))}$

with $\{\mathcal S_1,\dots,\mathcal S_K\}$ 0. Each $\{\mathcal S_1,\dots,\mathcal S_K\}$ 1 is then assigned a token budget

$\{\mathcal S_1,\dots,\mathcal S_K\}$ 2

for $\{\mathcal S_1,\dots,\mathcal S_K\}$ 3 the global keep-ratio and $\{\mathcal S_1,\dots,\mathcal S_K\}$ 4 initial token count.

Intra- and inter-image diversity: For multi-image input, intra-image token diversity is measured by average pairwise cosine distance

$\{\mathcal S_1,\dots,\mathcal S_K\}$ 5

averaged across images. Inter-image variation compares aligned tokens in order,

$\{\mathcal S_1,\dots,\mathcal S_K\}$ 6

with global mean $\{\mathcal S_1,\dots,\mathcal S_K\}$ 7 as their average. Dynamic token allocation is then made:

$\{\mathcal S_1,\dots,\mathcal S_K\}$ 8

for $\{\mathcal S_1,\dots,\mathcal S_K\}$ 9.

3. Two-Stage Pruning Algorithms

ALTP uses a multi-stage pipeline to maximize the utility of pruned representations.

3.1 Local Detail Preservation (GCG Focus)

Stage 1 (Detail Density Capture, DDC): Uniformly retains a baseline fraction of tokens per superpixel to guarantee object-relevant detail is present.
Stage 2 (Dynamic Density Formation, DDF): Re-allocates the overall token budget proportionally to the useful detail (area-normalized color variance) found in each superpixel, counteracting over-concentration or under-utilization due to varying object sizes or scene layouts.

Local token selection within each region is based on the magnitude of local cross- or self-attention, maximizing retention of salient visual features.

3.2 Multi-Image Adaptive Pruning (Long Context LMMs)

Stage 1 (Greedy Intra-Image Selection): For each image and its assigned budget, tokens are greedily selected to maximize Euclidean (cosine) dispersion, producing a subset with maximal diversity per image.
Stage 2 (Global Diversity and Pareto Filtering): All per-image outputs are aggregated. A second round of greedy selection produces a globally diverse candidate pool. Each token's utility is then jointly assessed via:
- Inter-image diversity ( $\Omega_k$ 0): mean dispersion with other tokens,
$\Omega_k$ 1 - Text alignment ( $\Omega_k$ 2): negative squared $\Omega_k$ 3 distance to the set of text-token embeddings,

$\Omega_k$ 4

A Pareto non-domination criterion selects the final set that is not strictly dominated in both diversity and alignment.

4. Implementation Details and Hyperparameters

Backbone compatibility: ALTP integrates flexibly with existing vision-language backbones. In GLaMM, it sits between the CLIP-ViT encoder and the downstream SAM-like decoder; in OMG-LLaVA, it prunes tokens before the cross-attention and segmentation output heads (Bai et al., 31 Mar 2025). For multi-image LMMs, ALTP prunes within the visual encoder pipeline before feeding pruned representations to the LLM (Zhang et al., 28 Dec 2025).
Pruning hyperparameters:
- Global keep-ratio $\Omega_k$ 5: experiments use $\Omega_k$ 6– $\Omega_k$ 7.
- Superpixel counts $\Omega_k$ 8, compactness $\Omega_k$ 9, softmax scale $T_k^{(\mathrm{DDC})} = \lceil r_k \cdot |\Omega_k| \rceil,$ 0.
- For long-context variant: $T_k^{(\mathrm{DDC})} = \lceil r_k \cdot |\Omega_k| \rceil,$ 1, $T_k^{(\mathrm{DDC})} = \lceil r_k \cdot |\Omega_k| \rceil,$ 2, $T_k^{(\mathrm{DDC})} = \lceil r_k \cdot |\Omega_k| \rceil,$ 3, $T_k^{(\mathrm{DDC})} = \lceil r_k \cdot |\Omega_k| \rceil,$ 4.
Computational aspects:
- SLIC segmentation overhead is typically negligible compared to transformer FLOPs, especially with GPU implementations.
- The two-stage greedy selection algorithm runs in $T_k^{(\mathrm{DDC})} = \lceil r_k \cdot |\Omega_k| \rceil,$ 5 per image and $T_k^{(\mathrm{DDC})} = \lceil r_k \cdot |\Omega_k| \rceil,$ 6 for global Pareto sorting.
- FLOPs reduction follows analytical form:
$T_k^{(\mathrm{DDC})} = \lceil r_k \cdot |\Omega_k| \rceil,$ 7

(layerwise vision transformer cost $T_k^{(\mathrm{DDC})} = \lceil r_k \cdot |\Omega_k| \rceil,$ 8; $T_k^{(\mathrm{DDC})} = \lceil r_k \cdot |\Omega_k| \rceil,$ 9 is the pruning location).

5. Empirical Evaluation and Performance

ALTP has demonstrated strong empirical results under extreme pruning conditions:

GLaMM on GranDf: Reducing $k$ 0 to $k$ 1 tokens ( $k$ 2 pruning) improves AP50 from $k$ 3 (FastV), $k$ 4 (PyramidDrop) to $k$ 5 (ALTP). mIOU rises to $k$ 6 ( $k$ 7 pp over PyramidDrop) and Recall to $k$ 8 (+5.0 pp) (Bai et al., 31 Mar 2025).
OMG-LLaVA on GranDf: Pruning from $k$ 9 to $d_k = \mathrm{Var}(\mathcal S_k)\sqrt{\frac{|\mathcal P_k|}{|\mathcal P_{\mathrm{total}}|}}$ 0 tokens achieves AP50 $d_k = \mathrm{Var}(\mathcal S_k)\sqrt{\frac{|\mathcal P_k|}{|\mathcal P_{\mathrm{total}}|}}$ 1 ( $d_k = \mathrm{Var}(\mathcal S_k)\sqrt{\frac{|\mathcal P_k|}{|\mathcal P_{\mathrm{total}}|}}$ 2 pp over PDrop), mIOU $d_k = \mathrm{Var}(\mathcal S_k)\sqrt{\frac{|\mathcal P_k|}{|\mathcal P_{\mathrm{total}}|}}$ 3 ( $d_k = \mathrm{Var}(\mathcal S_k)\sqrt{\frac{|\mathcal P_k|}{|\mathcal P_{\mathrm{total}}|}}$ 4 pp), Recall $d_k = \mathrm{Var}(\mathcal S_k)\sqrt{\frac{|\mathcal P_k|}{|\mathcal P_{\mathrm{total}}|}}$ 5 ( $d_k = \mathrm{Var}(\mathcal S_k)\sqrt{\frac{|\mathcal P_k|}{|\mathcal P_{\mathrm{total}}|}}$ 6 pp).
Ablation: DDC alone is competitive (e.g., GLaMM, AP50 $d_k = \mathrm{Var}(\mathcal S_k)\sqrt{\frac{|\mathcal P_k|}{|\mathcal P_{\mathrm{total}}|}}$ 7), but adding DDF confers additional gains ( $d_k = \mathrm{Var}(\mathcal S_k)\sqrt{\frac{|\mathcal P_k|}{|\mathcal P_{\mathrm{total}}|}}$ 8 pp AP50, $d_k = \mathrm{Var}(\mathcal S_k)\sqrt{\frac{|\mathcal P_k|}{|\mathcal P_{\mathrm{total}}|}}$ 9 pp mIOU).
Efficiency: For multi-image transformer models, ALTP maintains equivalent or superior prefill latencies (e.g., 809 ms vs. 810 ms for DivPrune at batch size $\mathrm{Var}(\mathcal S_k)$ 0– $\mathrm{Var}(\mathcal S_k)$ 1 images) while reducing memory usage by $\mathrm{Var}(\mathcal S_k)$ 2 via $\mathrm{Var}(\mathcal S_k)$ 3 token count reduction (Zhang et al., 28 Dec 2025).
Robustness: Even at a $\mathrm{Var}(\mathcal S_k)$ 4 token retention rate, ALTP maintains over $\mathrm{Var}(\mathcal S_k)$ 5 of the baseline performance on ALFRED and other benchmarks.

6. Insights, Limitations, and Prospects

ALTP achieves efficient inference acceleration without degrading, and frequently improving, segmentation fidelity and language grounding. By enforcing regional equilibria through DDC and dynamically assigning tokens via DDF, ALTP balances aggressive FLOPs reductions with retention of semantically critical details (Bai et al., 31 Mar 2025). High superpixel granularity ( $\mathrm{Var}(\mathcal S_k)$ 6) is consistently effective; with too few regions, the approach degenerates to non-local pruning.

Limitations include the extra segmentation step, which introduces minor extra computation but is not rate-limiting relative to transformer inference. Scenes densely packed with many small objects may require tuning $\mathrm{Var}(\mathcal S_k)$ 7 upward. Current ALTP implementations do not incorporate textual prompts for region weighting; prompt-aware pruning is a direction for future research.

For long-context LMMs, ALTP’s intra-/inter-redundancy decomposition makes it especially suitable for tasks with multiple, potentially repetitive images, dynamically matching the token budget with the content's novelty and saliency (Zhang et al., 28 Dec 2025).

7. Relationship to Prior Art

ALTP distinguishes itself from earlier token pruning approaches by explicitly prioritizing object-centric and locally dense visual regions, in contrast to global cross-attention-based deletion used by FastV, PyramidDrop, and similar methods. Whereas these global schemes often introduce catastrophic failures in tasks demanding spatial grounding, ALTP systematically recovers or surpasses baseline accuracy, even under $\mathrm{Var}(\mathcal S_k)$ 8– $\mathrm{Var}(\mathcal S_k)$ 9 token pruning, thereby enabling up to $|\mathcal P_k|$ 0 inference speedup with improved segmentation and grounding performance. The combination of local region quotas (DDC), adaptive density-driven token allocation (DDF), and multi-stage greedy/Pareto maximization is not present in previous pruning pipelines (Bai et al., 31 Mar 2025, Zhang et al., 28 Dec 2025).

A plausible implication is that similar local-aware, adaptively density-weighted token pruning strategies may be beneficial across a broader spectrum of multimodal and vision-language tasks, particularly in domains where fine-grained spatial reasoning and long-context reasoning co-occur.

Markdown Report Issue Upgrade to Chat

References (2)

Local Information Matters: Inference Acceleration For Grounded Conversation Generation Models Through Adaptive Local-Aware Token Pruning (2025)

TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Local-Aware Token Pruning (ALTP).

Adaptive Local-Aware Token Pruning (ALTP)

1. Motivation and Scope

2. Key Concepts and Mathematical Formulation

3. Two-Stage Pruning Algorithms

3.1 Local Detail Preservation (GCG Focus)

3.2 Multi-Image Adaptive Pruning (Long Context LMMs)

4. Implementation Details and Hyperparameters

5. Empirical Evaluation and Performance

6. Insights, Limitations, and Prospects

7. Relationship to Prior Art

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Adaptive Local-Aware Token Pruning (ALTP)

1. Motivation and Scope

2. Key Concepts and Mathematical Formulation

3. Two-Stage Pruning Algorithms

3.1 Local Detail Preservation (GCG Focus)

3.2 Multi-Image Adaptive Pruning (Long Context LMMs)

4. Implementation Details and Hyperparameters

5. Empirical Evaluation and Performance

6. Insights, Limitations, and Prospects

7. Relationship to Prior Art

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research