STAR: Stage-wise Attention-Guided Reduction

Updated 4 March 2026

The paper introduces STAR, a training-free, two-stage attention-guided framework that efficiently prunes redundant visual tokens in large vision-language models.
STAR employs early self-attention pruning followed by cross-attention pruning to discard irrelevant tokens while preserving crucial task-specific information.
Empirical evaluations on VQA benchmarks indicate that STAR achieves up to 29% FLOP reduction with less than 1% accuracy drop, enabling effective deployment in latency-sensitive environments.

Stage-Wise Attention-Guided Reduction (STAR) is a training-free, plug-and-play framework for efficient inference in large vision-LLMs (LVLMs). LVLMs such as LLaVA combine a Vision Transformer (ViT)-style visual encoder with a Transformer-based LLM decoder, where high-resolution images produce hundreds to thousands of visual tokens. These tokens introduce significant computational overhead during inference; however, a substantial portion of them are either redundant or irrelevant to the multimodal task at hand. STAR addresses this challenge by implementing a global, two-stage attention-guided token pruning procedure that reduces computational requirements while preserving or even enhancing downstream task fidelity (Guo et al., 18 May 2025).

1. Motivation and High-Level Approach

Traditional training-free token pruning in LVLMs has relied on single-stage strategies, operating either just after vision encoding (via self-attention) or at the cross-modal interface (via cross-attention). Such local perspectives frequently lead to suboptimal information flow and substantial performance degradation under high pruning ratios. STAR introduces a two-stage, global approach:

Stage 1: Early pruning, immediately following the vision encoder, removes redundant, low-level visual tokens via visual self-attention analysis.
Stage 2: At an intermediate decoder layer, further aggressive pruning is performed, this time guided by cross-attention between the surviving visual tokens and the text context (prompt plus partially generated response), thus discarding task-irrelevant tokens.

This holistic reduction scheme allows STAR to minimize FLOPs, memory consumption, and latency while retaining accuracy, maintaining robust performance even with only 5–10% of original visual tokens.

2. Formal Definitions and Pruning Mechanism

Given $L_v$ initial visual tokens with embeddings $H_v \in \mathbb{R}^{L_v \times d}$ and text context embeddings $H_q \in \mathbb{R}^{L_t \times d}$ (prompt) and $H_{\mathrm{resp}} \in \mathbb{R}^{L_o \times d}$ (partial response), token reduction unfolds in two attention-guided stages:

Stage 1 (Self-Attention Pruning):

Compute the ViT self-attention map:

$A = \mathrm{Softmax}\left(\frac{H_v H_v^\top}{\sqrt{d}}\right), \quad A \in \mathbb{R}^{L_v \times L_v}.$

Self-attention importance for token $i$ :

$I_i^{(\mathrm{self})} = \frac{1}{L_v} \sum_{j=1}^{L_v} A_{ij}.$

For a reduction ratio $R \in (0,1)$ , retain the top $(1-R) L_v$ tokens according to $I_i^{(\mathrm{self})}$ .
Project the surviving tokens for input to the LLM decoder.

Stage 2 (Visual–Textual Cross-Attention Pruning):

At intermediate decoder layer $K$ , let $L_v'$ denote the number of tokens after Stage 1.
Form text+response context $\widetilde H_q \in \mathbb{R}^{(L_t+L_o)\times d}$ .
Compute decoder cross-attention:

$C_K = \mathrm{Softmax}\left(\frac{H_v \widetilde H_q^{\!\top}}{\sqrt{d}}\right), \quad C_K \in \mathbb{R}^{L_v' \times (L_t+L_o)}.$

Cross-modal importance for token $i$ :

$I_i^{(\mathrm{cross})} = \frac{1}{L_t+L_o} \sum_{j=1}^{L_t+L_o} C_K[i, j].$

For a pruning ratio $P \in (0,1)$ , forward only the top $(1-P) L_v'$ tokens to subsequent layers.

This dual-stage mechanism preserves both low-level visual structure and high-level task relevance, outperforming single-stage approaches in robustness to aggressive pruning.

3. Algorithmic Realization within the LVLM Pipeline

The STAR framework integrates into standard LVLM inference loops, with pruning occuring both after the vision encoder and mid-decoder. The procedural flow is as follows:

H_v = Project(Z_v)                                # L_v × d
A = Softmax( (H_v H_v^T)/√d )                     # L_v × L_v
I_self[i] = (1/L_v) * sum(A[i, :]) for i in range(L_v)
keep_idx1 = argsort_descending(I_self)[:ceil((1–R) * L_v)]
Z_v_prime = Z_v[keep_idx1]                        # Now L_v' tokens

H_v_prime = g(Z_v_prime)                          # Project to d-dim
X = concat(H_v_prime, H_q)                        # (L_v'+L_t) × d

for i in range(Ω):
    if i == K:
        # Stage 2: Cross-Attention Pruning
        H_vis, H_txt = split(X)
        C = Softmax( (H_vis H_txt^T)/√d )
        I_cross[j] = (1/len(H_txt)) * sum(C[j, :]) for j in range(len(H_vis))
        keep_idx2 = argsort_descending(I_cross)[:ceil((1–P) * len(H_vis))]
        H_vis = H_vis[keep_idx2]
        X = concat(H_vis, H_txt)
    X = TransformerDecoderLayer_i(X)

The sequencing of stages ensures conservative early pruning (preserving spatial diversity) and aggressive, context-aware late pruning (maximizing efficiency).

4. Theoretical Analysis of Computational Savings

A single Transformer decoder layer with sequence length $L$ and hidden width $D$ incurs a baseline multiply-add count of

$F_{\mathrm{base}} = 6 L D^2 + 2 L^2 D.$

Reducing the sequence by $N$ tokens yields per-layer savings

$\Delta = 6 N D^2 + 2 N^2 D.$

STAR applies self-attention-based token pruning at ratio $R$ up to layer $K$ , and cross-attention-based pruning at ratio $P$ from layer $K+1$ through final depth $\Omega$ . Cumulative FLOP savings are

$\Delta_{\mathrm{stage1}} = K \left[6 R L_v^0 D^2 + 2 (R L_v^0)^2 D\right], \ \Delta_{\mathrm{stage2}} = (\Omega-K) \left[6 P L_v^0 D^2 + 2 (P L_v^0)^2 D\right], \ \Delta_{\mathrm{total}} = \Delta_{\mathrm{stage1}} + \Delta_{\mathrm{stage2}},$

where $L_v^0$ is the initial visual token count. Empirical evaluation demonstrates $\sim$ 29% total inference FLOP reduction on LLaVA-1.5–7B with only 1–2% drop in end-to-end accuracy (Guo et al., 18 May 2025).

5. Empirical Evaluation and Benchmarks

STAR is validated across eight visual question answering (VQA) benchmarks: VQAv2, GQA, VizWiz, ScienceQA-IMG, TextVQA, POPE, MME, MM-VET. Tested LVLMs include LLaVA-1.5 (7B and 13B parameters) and LLaVA-NeXT-7B. STAR is compared against FastV (mid-decoder cross-attention pruning), FasterVLM ([CLS]-to-patch self-attention pruning), and SparseVLM (progressive cross-modal pruning with recycling).

Key empirical results for LLaVA-1.5-7B:

At 50% token retention (288/576), STAR achieves a 28.7% FLOP reduction, 0.77 GiB memory savings, and $<$ 1% additional latency, all while preserving $>$ 99% of original accuracy on all benchmarks.
Under 5% retention (∼29 tokens), STAR maintains 97–99% of baseline performance, whereas FastV exhibits a 22-point drop on VQAv2.
On larger models (LLaVA-1.5–13B, LLaVA-NeXT-7B), STAR consistently surpasses FastV and FasterVLM in accuracy–efficiency trade-off across TextVQA, SQA-IMG, MME, and POPE.

Ablations indicate isolating either self-attention pruning or cross-attention pruning leads to inferior outcomes: the former induces mid-decoder inaccuracies due to missed task relevance, while the latter allows too many non-salient tokens to persist. Stage-wise coupling combines these advantages, ensuring both noise removal and signal preservation.

6. Design Properties and Deployment Implications

STAR is training-free and plug-and-play, requiring no model weight updates and integrating as a module into established LVLM inference pipelines. By strictly separating early visual feature culling (self-attentive, text-agnostic) and late-stage, query-aware distillation (cross-attentive, text-guided), STAR supports substantial real-world acceleration (20–40% FLOP savings) with negligible accuracy sacrifice, even with token counts reduced to single digits.

These properties render STAR particularly suited for deployment in latency-sensitive environments (e.g., high-resolution, interactive multimodal applications), where hardware resource constraints and prompt response are critical. The framework’s combination of substantial efficiency gains and minimal accuracy loss, validated across diverse domains and model scales, evidences its practical utility in contemporary vision-language systems (Guo et al., 18 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stage-wise Attention-Guided Reduction (STAR).

STAR: Stage-wise Attention-Guided Reduction

1. Motivation and High-Level Approach

2. Formal Definitions and Pruning Mechanism

3. Algorithmic Realization within the LVLM Pipeline

4. Theoretical Analysis of Computational Savings

5. Empirical Evaluation and Benchmarks

6. Design Properties and Deployment Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

STAR: Stage-wise Attention-Guided Reduction

1. Motivation and High-Level Approach

2. Formal Definitions and Pruning Mechanism

3. Algorithmic Realization within the LVLM Pipeline

4. Theoretical Analysis of Computational Savings

5. Empirical Evaluation and Benchmarks

6. Design Properties and Deployment Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research