Papers
Topics
Authors
Recent
Search
2000 character limit reached

STAR: Stage-wise Attention-Guided Reduction

Updated 4 March 2026
  • The paper introduces STAR, a training-free, two-stage attention-guided framework that efficiently prunes redundant visual tokens in large vision-language models.
  • STAR employs early self-attention pruning followed by cross-attention pruning to discard irrelevant tokens while preserving crucial task-specific information.
  • Empirical evaluations on VQA benchmarks indicate that STAR achieves up to 29% FLOP reduction with less than 1% accuracy drop, enabling effective deployment in latency-sensitive environments.

Stage-Wise Attention-Guided Reduction (STAR) is a training-free, plug-and-play framework for efficient inference in large vision-LLMs (LVLMs). LVLMs such as LLaVA combine a Vision Transformer (ViT)-style visual encoder with a Transformer-based LLM decoder, where high-resolution images produce hundreds to thousands of visual tokens. These tokens introduce significant computational overhead during inference; however, a substantial portion of them are either redundant or irrelevant to the multimodal task at hand. STAR addresses this challenge by implementing a global, two-stage attention-guided token pruning procedure that reduces computational requirements while preserving or even enhancing downstream task fidelity (Guo et al., 18 May 2025).

1. Motivation and High-Level Approach

Traditional training-free token pruning in LVLMs has relied on single-stage strategies, operating either just after vision encoding (via self-attention) or at the cross-modal interface (via cross-attention). Such local perspectives frequently lead to suboptimal information flow and substantial performance degradation under high pruning ratios. STAR introduces a two-stage, global approach:

  • Stage 1: Early pruning, immediately following the vision encoder, removes redundant, low-level visual tokens via visual self-attention analysis.
  • Stage 2: At an intermediate decoder layer, further aggressive pruning is performed, this time guided by cross-attention between the surviving visual tokens and the text context (prompt plus partially generated response), thus discarding task-irrelevant tokens.

This holistic reduction scheme allows STAR to minimize FLOPs, memory consumption, and latency while retaining accuracy, maintaining robust performance even with only 5–10% of original visual tokens.

2. Formal Definitions and Pruning Mechanism

Given LvL_v initial visual tokens with embeddings HvRLv×dH_v \in \mathbb{R}^{L_v \times d} and text context embeddings HqRLt×dH_q \in \mathbb{R}^{L_t \times d} (prompt) and HrespRLo×dH_{\mathrm{resp}} \in \mathbb{R}^{L_o \times d} (partial response), token reduction unfolds in two attention-guided stages:

Stage 1 (Self-Attention Pruning):

  • Compute the ViT self-attention map:

A=Softmax(HvHvd),ARLv×Lv.A = \mathrm{Softmax}\left(\frac{H_v H_v^\top}{\sqrt{d}}\right), \quad A \in \mathbb{R}^{L_v \times L_v}.

  • Self-attention importance for token ii:

Ii(self)=1Lvj=1LvAij.I_i^{(\mathrm{self})} = \frac{1}{L_v} \sum_{j=1}^{L_v} A_{ij}.

  • For a reduction ratio R(0,1)R \in (0,1), retain the top (1R)Lv(1-R) L_v tokens according to Ii(self)I_i^{(\mathrm{self})}.
  • Project the surviving tokens for input to the LLM decoder.

Stage 2 (Visual–Textual Cross-Attention Pruning):

  • At intermediate decoder layer KK, let LvL_v' denote the number of tokens after Stage 1.
  • Form text+response context H~qR(Lt+Lo)×d\widetilde H_q \in \mathbb{R}^{(L_t+L_o)\times d}.
  • Compute decoder cross-attention:

CK=Softmax(HvH~q ⁣d),CKRLv×(Lt+Lo).C_K = \mathrm{Softmax}\left(\frac{H_v \widetilde H_q^{\!\top}}{\sqrt{d}}\right), \quad C_K \in \mathbb{R}^{L_v' \times (L_t+L_o)}.

  • Cross-modal importance for token ii:

Ii(cross)=1Lt+Loj=1Lt+LoCK[i,j].I_i^{(\mathrm{cross})} = \frac{1}{L_t+L_o} \sum_{j=1}^{L_t+L_o} C_K[i, j].

  • For a pruning ratio P(0,1)P \in (0,1), forward only the top (1P)Lv(1-P) L_v' tokens to subsequent layers.

This dual-stage mechanism preserves both low-level visual structure and high-level task relevance, outperforming single-stage approaches in robustness to aggressive pruning.

3. Algorithmic Realization within the LVLM Pipeline

The STAR framework integrates into standard LVLM inference loops, with pruning occuring both after the vision encoder and mid-decoder. The procedural flow is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
H_v = Project(Z_v)                                # L_v × d
A = Softmax( (H_v H_v^T)/d )                     # L_v × L_v
I_self[i] = (1/L_v) * sum(A[i, :]) for i in range(L_v)
keep_idx1 = argsort_descending(I_self)[:ceil((1R) * L_v)]
Z_v_prime = Z_v[keep_idx1]                        # Now L_v' tokens

H_v_prime = g(Z_v_prime)                          # Project to d-dim
X = concat(H_v_prime, H_q)                        # (L_v'+L_t) × d

for i in range(Ω):
    if i == K:
        # Stage 2: Cross-Attention Pruning
        H_vis, H_txt = split(X)
        C = Softmax( (H_vis H_txt^T)/d )
        I_cross[j] = (1/len(H_txt)) * sum(C[j, :]) for j in range(len(H_vis))
        keep_idx2 = argsort_descending(I_cross)[:ceil((1P) * len(H_vis))]
        H_vis = H_vis[keep_idx2]
        X = concat(H_vis, H_txt)
    X = TransformerDecoderLayer_i(X)

The sequencing of stages ensures conservative early pruning (preserving spatial diversity) and aggressive, context-aware late pruning (maximizing efficiency).

4. Theoretical Analysis of Computational Savings

A single Transformer decoder layer with sequence length LL and hidden width DD incurs a baseline multiply-add count of

Fbase=6LD2+2L2D.F_{\mathrm{base}} = 6 L D^2 + 2 L^2 D.

Reducing the sequence by NN tokens yields per-layer savings

Δ=6ND2+2N2D.\Delta = 6 N D^2 + 2 N^2 D.

STAR applies self-attention-based token pruning at ratio RR up to layer KK, and cross-attention-based pruning at ratio PP from layer K+1K+1 through final depth Ω\Omega. Cumulative FLOP savings are

Δstage1=K[6RLv0D2+2(RLv0)2D], Δstage2=(ΩK)[6PLv0D2+2(PLv0)2D], Δtotal=Δstage1+Δstage2,\Delta_{\mathrm{stage1}} = K \left[6 R L_v^0 D^2 + 2 (R L_v^0)^2 D\right], \ \Delta_{\mathrm{stage2}} = (\Omega-K) \left[6 P L_v^0 D^2 + 2 (P L_v^0)^2 D\right], \ \Delta_{\mathrm{total}} = \Delta_{\mathrm{stage1}} + \Delta_{\mathrm{stage2}},

where Lv0L_v^0 is the initial visual token count. Empirical evaluation demonstrates \sim29% total inference FLOP reduction on LLaVA-1.5–7B with only 1–2% drop in end-to-end accuracy (Guo et al., 18 May 2025).

5. Empirical Evaluation and Benchmarks

STAR is validated across eight visual question answering (VQA) benchmarks: VQAv2, GQA, VizWiz, ScienceQA-IMG, TextVQA, POPE, MME, MM-VET. Tested LVLMs include LLaVA-1.5 (7B and 13B parameters) and LLaVA-NeXT-7B. STAR is compared against FastV (mid-decoder cross-attention pruning), FasterVLM ([CLS]-to-patch self-attention pruning), and SparseVLM (progressive cross-modal pruning with recycling).

Key empirical results for LLaVA-1.5-7B:

  • At 50% token retention (288/576), STAR achieves a 28.7% FLOP reduction, 0.77 GiB memory savings, and <<1% additional latency, all while preserving >>99% of original accuracy on all benchmarks.
  • Under 5% retention (∼29 tokens), STAR maintains 97–99% of baseline performance, whereas FastV exhibits a 22-point drop on VQAv2.
  • On larger models (LLaVA-1.5–13B, LLaVA-NeXT-7B), STAR consistently surpasses FastV and FasterVLM in accuracy–efficiency trade-off across TextVQA, SQA-IMG, MME, and POPE.

Ablations indicate isolating either self-attention pruning or cross-attention pruning leads to inferior outcomes: the former induces mid-decoder inaccuracies due to missed task relevance, while the latter allows too many non-salient tokens to persist. Stage-wise coupling combines these advantages, ensuring both noise removal and signal preservation.

6. Design Properties and Deployment Implications

STAR is training-free and plug-and-play, requiring no model weight updates and integrating as a module into established LVLM inference pipelines. By strictly separating early visual feature culling (self-attentive, text-agnostic) and late-stage, query-aware distillation (cross-attentive, text-guided), STAR supports substantial real-world acceleration (20–40% FLOP savings) with negligible accuracy sacrifice, even with token counts reduced to single digits.

These properties render STAR particularly suited for deployment in latency-sensitive environments (e.g., high-resolution, interactive multimodal applications), where hardware resource constraints and prompt response are critical. The framework’s combination of substantial efficiency gains and minimal accuracy loss, validated across diverse domains and model scales, evidences its practical utility in contemporary vision-language systems (Guo et al., 18 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stage-wise Attention-Guided Reduction (STAR).