STAR: Stage-wise Attention-Guided Reduction
- The paper introduces STAR, a training-free, two-stage attention-guided framework that efficiently prunes redundant visual tokens in large vision-language models.
- STAR employs early self-attention pruning followed by cross-attention pruning to discard irrelevant tokens while preserving crucial task-specific information.
- Empirical evaluations on VQA benchmarks indicate that STAR achieves up to 29% FLOP reduction with less than 1% accuracy drop, enabling effective deployment in latency-sensitive environments.
Stage-Wise Attention-Guided Reduction (STAR) is a training-free, plug-and-play framework for efficient inference in large vision-LLMs (LVLMs). LVLMs such as LLaVA combine a Vision Transformer (ViT)-style visual encoder with a Transformer-based LLM decoder, where high-resolution images produce hundreds to thousands of visual tokens. These tokens introduce significant computational overhead during inference; however, a substantial portion of them are either redundant or irrelevant to the multimodal task at hand. STAR addresses this challenge by implementing a global, two-stage attention-guided token pruning procedure that reduces computational requirements while preserving or even enhancing downstream task fidelity (Guo et al., 18 May 2025).
1. Motivation and High-Level Approach
Traditional training-free token pruning in LVLMs has relied on single-stage strategies, operating either just after vision encoding (via self-attention) or at the cross-modal interface (via cross-attention). Such local perspectives frequently lead to suboptimal information flow and substantial performance degradation under high pruning ratios. STAR introduces a two-stage, global approach:
- Stage 1: Early pruning, immediately following the vision encoder, removes redundant, low-level visual tokens via visual self-attention analysis.
- Stage 2: At an intermediate decoder layer, further aggressive pruning is performed, this time guided by cross-attention between the surviving visual tokens and the text context (prompt plus partially generated response), thus discarding task-irrelevant tokens.
This holistic reduction scheme allows STAR to minimize FLOPs, memory consumption, and latency while retaining accuracy, maintaining robust performance even with only 5–10% of original visual tokens.
2. Formal Definitions and Pruning Mechanism
Given initial visual tokens with embeddings and text context embeddings (prompt) and (partial response), token reduction unfolds in two attention-guided stages:
Stage 1 (Self-Attention Pruning):
- Compute the ViT self-attention map:
- Self-attention importance for token :
- For a reduction ratio , retain the top tokens according to .
- Project the surviving tokens for input to the LLM decoder.
Stage 2 (Visual–Textual Cross-Attention Pruning):
- At intermediate decoder layer , let denote the number of tokens after Stage 1.
- Form text+response context .
- Compute decoder cross-attention:
- Cross-modal importance for token :
- For a pruning ratio , forward only the top tokens to subsequent layers.
This dual-stage mechanism preserves both low-level visual structure and high-level task relevance, outperforming single-stage approaches in robustness to aggressive pruning.
3. Algorithmic Realization within the LVLM Pipeline
The STAR framework integrates into standard LVLM inference loops, with pruning occuring both after the vision encoder and mid-decoder. The procedural flow is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
H_v = Project(Z_v) # L_v × d A = Softmax( (H_v H_v^T)/√d ) # L_v × L_v I_self[i] = (1/L_v) * sum(A[i, :]) for i in range(L_v) keep_idx1 = argsort_descending(I_self)[:ceil((1–R) * L_v)] Z_v_prime = Z_v[keep_idx1] # Now L_v' tokens H_v_prime = g(Z_v_prime) # Project to d-dim X = concat(H_v_prime, H_q) # (L_v'+L_t) × d for i in range(Ω): if i == K: # Stage 2: Cross-Attention Pruning H_vis, H_txt = split(X) C = Softmax( (H_vis H_txt^T)/√d ) I_cross[j] = (1/len(H_txt)) * sum(C[j, :]) for j in range(len(H_vis)) keep_idx2 = argsort_descending(I_cross)[:ceil((1–P) * len(H_vis))] H_vis = H_vis[keep_idx2] X = concat(H_vis, H_txt) X = TransformerDecoderLayer_i(X) |
The sequencing of stages ensures conservative early pruning (preserving spatial diversity) and aggressive, context-aware late pruning (maximizing efficiency).
4. Theoretical Analysis of Computational Savings
A single Transformer decoder layer with sequence length and hidden width incurs a baseline multiply-add count of
Reducing the sequence by tokens yields per-layer savings
STAR applies self-attention-based token pruning at ratio up to layer , and cross-attention-based pruning at ratio from layer through final depth . Cumulative FLOP savings are
where is the initial visual token count. Empirical evaluation demonstrates 29% total inference FLOP reduction on LLaVA-1.5–7B with only 1–2% drop in end-to-end accuracy (Guo et al., 18 May 2025).
5. Empirical Evaluation and Benchmarks
STAR is validated across eight visual question answering (VQA) benchmarks: VQAv2, GQA, VizWiz, ScienceQA-IMG, TextVQA, POPE, MME, MM-VET. Tested LVLMs include LLaVA-1.5 (7B and 13B parameters) and LLaVA-NeXT-7B. STAR is compared against FastV (mid-decoder cross-attention pruning), FasterVLM ([CLS]-to-patch self-attention pruning), and SparseVLM (progressive cross-modal pruning with recycling).
Key empirical results for LLaVA-1.5-7B:
- At 50% token retention (288/576), STAR achieves a 28.7% FLOP reduction, 0.77 GiB memory savings, and 1% additional latency, all while preserving 99% of original accuracy on all benchmarks.
- Under 5% retention (∼29 tokens), STAR maintains 97–99% of baseline performance, whereas FastV exhibits a 22-point drop on VQAv2.
- On larger models (LLaVA-1.5–13B, LLaVA-NeXT-7B), STAR consistently surpasses FastV and FasterVLM in accuracy–efficiency trade-off across TextVQA, SQA-IMG, MME, and POPE.
Ablations indicate isolating either self-attention pruning or cross-attention pruning leads to inferior outcomes: the former induces mid-decoder inaccuracies due to missed task relevance, while the latter allows too many non-salient tokens to persist. Stage-wise coupling combines these advantages, ensuring both noise removal and signal preservation.
6. Design Properties and Deployment Implications
STAR is training-free and plug-and-play, requiring no model weight updates and integrating as a module into established LVLM inference pipelines. By strictly separating early visual feature culling (self-attentive, text-agnostic) and late-stage, query-aware distillation (cross-attentive, text-guided), STAR supports substantial real-world acceleration (20–40% FLOP savings) with negligible accuracy sacrifice, even with token counts reduced to single digits.
These properties render STAR particularly suited for deployment in latency-sensitive environments (e.g., high-resolution, interactive multimodal applications), where hardware resource constraints and prompt response are critical. The framework’s combination of substantial efficiency gains and minimal accuracy loss, validated across diverse domains and model scales, evidences its practical utility in contemporary vision-language systems (Guo et al., 18 May 2025).