Small VLM Early Exiting (SEE)

Updated 6 December 2025

Small VLM Early Exiting (SEE) is a framework that leverages lightweight multimodal models to dynamically terminate inference based on confidence and scenario guidance.
It utilizes global attention maps and token-pruning strategies to reduce computational cost while maintaining robust accuracy across diverse VLM tasks.
SEE’s architectural variants, including cascaded, multi-exit, and scenario-guided models, demonstrate significant latency reductions with minimal impact on performance.

Small Vision-LLM Early Exiting (SEE) is a framework for improving the efficiency and responsiveness of vision-LLMs (VLMs) by leveraging both architectural innovations and early-exit mechanisms. SEE encompasses methods that enable small multimodal transformers to terminate inference dynamically, either based on model confidence, consistency, input complexity, or explicit scenario guidance, thereby sharply reducing computational cost while preserving task performance.

1. Conceptual Overview and Motivation

SEE addresses the prominent issue that large VLMs, while state-of-the-art in multi-modal tasks, incur substantial computational overhead, primarily due to expensive processing of voluminous vision tokens and deep transformer stacks. Traditional pruning techniques based on partial attention or fixed confidence-based gating offer only limited savings and may compromise accuracy. Recent findings demonstrate that global attention maps derived from small VLMs closely resemble those of large VLMs, presenting the opportunity to use small VLMs both as efficient pruners and as providers of reliable candidate predictions (Zhao et al., 4 Dec 2024).

Moreover, SEE generalizes beyond pruning to include structured early-exit decision policies at multiple points in a model, adaptability via scenario-aware control (e.g., navigation-guided exits in autonomous driving (Hu et al., 2 Oct 2025)), and improved calibration, making it applicable to diverse vision-language domains.

2. Pipeline Design and Early-Exit Criteria

SEE systems integrate small VLMs into the multi-stage inference pipeline with two core roles:

Guidance for Pruning: The small VLM computes an aggregated global attention map across all vision tokens, identifying salient regions to preserve during inference by a larger VLM. Pruning is performed according to this map, retaining the top $R\%$ most relevant tokens for further processing (Zhao et al., 4 Dec 2024).
Dynamic Early Exit: The small VLM generates a candidate output $x_G$ $x_{G}$ . Its reliability is quantitatively assessed using two complementary metrics:
- Confidence Score ( $\mathcal{S}_{\mathrm{conf}}$ ): Length-normalized probability of the generated answer sequence.
$\mathcal{S}_{\mathrm{conf}} = \exp \left\{ \frac{1}{N_G} \log P(x_G^1, \ldots, x_G^{N_G}) \right\}$

Consistency Score ( $\mathcal{S}_{\mathrm{cons}}$ ): Stability of the prediction under severe token pruning.

$\mathcal{S}_{\mathrm{cons}} = \prod_{i=1}^{N_G} P \left( x_G^i \mid \text{VLM}^S, \text{prompt}, \text{pruned tokens} \right)$
The final early-exit score is averaged: $\mathcal{S} = \frac{1}{2} (\mathcal{S}_{\mathrm{conf}} + \mathcal{S}_{\mathrm{cons}})$ .

If $\mathcal{S} \geq \tau$ (a pre-set threshold), inference terminates and the small VLM’s output is returned; otherwise, the pruned token set is forwarded to the large VLM.

3. Architectural Variants and Algorithmic Recipes

SEE has been instantiated across several architectures:

Small-Large VLM Cascades: As in (Zhao et al., 4 Dec 2024), a small InternVL2-2B model generates both attention maps and candidate answers. Token-pruned inputs are passed to large InternVL2-26B models when necessary.
Multimodal Early Exiting: LayoutLMv3-based models introduce intermediate exit heads (ramps/gates) after key fusion or transformer layers. Exiting is controlled by maximum softmax probability (MSP) or gated confidence, with calibration improving exit reliability (Hamed et al., 21 May 2024).
Sequence-to-Sequence (Seq2Seq) Early Exits: Multiple exits in both encoder and decoder branches, controlled by layer-wise similarity (cosine similarity between consecutive layer representations) and decaying thresholds for autoregressive decoding (Tang et al., 2022). Exits can occur independently in vision/text modalities.
Knowledge-Distilled SEE: Models such as DEEVISum pair multi-stage knowledge distillation (from large teacher to small student via a mentor) with auxiliary exit heads at partitioned decoder depths. Exits leverage maximum-softmax or entropy-based confidence, tuned per head for latency-accuracy trade-off (Khan et al., 30 Apr 2025).
Scenario-Guided (Nav-EE) Early Exiting: In autonomous driving, dedicated exit layers are offline-profiled for each scenario/task (e.g., traffic-light, pedestrian detection), then dynamically selected online per navigation module cues (Hu et al., 2 Oct 2025).

The table below summarizes several SEE design points:

System	Early Exit Criterion	Exit Placement	Typical Speedup / Loss
SGL (Zhao et al., 4 Dec 2024)	Conf.+Consistency (small VLM)	Global (skip large model)	80% FLOP ↓, ~5% acc.↓
MuE (Tang et al., 2022)	Layerwise CosSim / DecSim	Both encoder & decoder (multi-mod.)	30–50% FLOP ↓, <1%↓
LayoutLMv3 (Hamed et al., 21 May 2024)	MSP + calibrated thresholds	Fusion layer / key transformer indices	20% latency ↓, 0%↓
DEEVISum (Khan et al., 30 Apr 2025)	Max-softmax / entropy	Partitioned decoder exits	21% time ↓, ~1.3 F1↓
Nav-EE (Hu et al., 2 Oct 2025)	Scenario-based offline profiling	Per-context, task-specific layers	51–64% latency ↓, ↑acc.

4. Training, Calibration, and Knowledge Transfer

SEE frameworks employ targeted training strategies for robust early-exit behavior:

Multi-Exit Loss Aggregation: Joint loss over all exits (cross-entropy) and top classifier, with weighting schemes (subgraph, entropy regularization) to ensure all heads are predictive (Hamed et al., 21 May 2024).
Layerwise Loss in Seq2Seq: Cross-entropy added at every decoder layer, enabling reliable generation for premature exits (Tang et al., 2022).
Multi-Stage Knowledge Distillation: Student models are jointly supervised by mentor and teacher at both final and intermediate exit heads, using weighted KL-divergence terms for distribution matching (Khan et al., 30 Apr 2025).
Calibration: Post-hoc temperature scaling and, when needed, Dirichlet calibration at each exit head are used to make confidence scores MSP reliable across all exits, reducing expected calibration error (ECE). Threshold selection combines accuracy and calibration error via $T_b = 1 - (\mathrm{ACC}_b / \mathrm{ECE}_b)$ and MinMax normalization (Hamed et al., 21 May 2024).

5. Efficiency–Performance Trade-Offs and Empirical Outcomes

SEE consistently delivers large reductions in inference time and FLOPs with moderate or negligible drops in accuracy, under rigorous evaluation protocols:

SGL (Zhao et al., 4 Dec 2024): At $R=9\%$ retained tokens, pruned Large VLM achieves $\approx 78.0\%$ TextVQA accuracy ( $-4.5\%$ ), with $\approx 65\%$ latency drop. Even the small VLM alone retains $89.6\%$ of full model score after aggressive pruning.
MuE (Tang et al., 2022): OFA Base (6E+6D) with MuE reduces expected time by $\approx 50\%$ (SNLI-VE) and $40\%$ (MS COCO), with $-0.7\%$ and $-9.7$ CIDEr drop, respectively.
Document Image SEE (Hamed et al., 21 May 2024): Weighted Concat-Quarter ramps with calibrated exits yield $20\%$ latency reduction at parity accuracy ( $80.75\%$ ).
DEEVISum (Khan et al., 30 Apr 2025): Early-Exit + MSKD achieves $21\%$ average time reduction at only $1.3$ F1 drop—outperforming prior basic distillation approaches.
Nav-EE (Hu et al., 2 Oct 2025): Navigation-contextual exits yield up to $63.9\%$ latency decrease and $25.6\%$ accuracy increase for LLaVA-7B; real-vehicle deployments demonstrate halved inference time.

6. Practical Guidelines for Deployment

Key recommendations for constructing and tuning SEE systems:

Model Selection: Start with small- to medium-sized VLMs (1B–4B), and consider both vision and language pathway depth for exit head insertion.
Exit Placement: 2–5 exits balance accuracy/latency. Choices include after each quarter of model depth or post-fusion layers.
Threshold/Parameter Tuning: Set thresholds on confidence/similarity via grid search or the recommended accuracy–ECE trade-off for Pareto-efficient frontier points. Adjust layer-dependent thresholds for granular control.
Pruning: Aggressive token pruning (e.g., $R=9\%$ ) works well when guided by small VLM attention across both prompt and generated tokens (Zhao et al., 4 Dec 2024).
Calibration: Use temperature scaling and, optionally, Dirichlet-calibration per-exit for reliable anytime exit decision-making.
Task-Specific Guidance: For scenario-dependent inference (autonomous driving), precompute task-optimal exit layers offline, then select dynamically per scenario (Hu et al., 2 Oct 2025).
Knowledge Transfer: Adopt multi-stage distillation when possible to maximize small VLM predictive capability (Khan et al., 30 Apr 2025).

This suggests SEE offers flexibility—the practitioner can scale from strict accuracy to ultra-low latency regimes using the same architecture, dominating the accuracy/performance Pareto frontier relative to single-model or single-exit approaches.

7. Variants, Open Challenges, and Future Directions

Several research lines and limitations pertain to SEE adoption:

Scenario-Guided vs. Confidence-Gated Exiting: Nav-EE demonstrates the advantage of offline profiled, scenario-aware exits in environments like autonomous driving, achieving gains unattainable via uniform or confidence-gated exits (Hu et al., 2 Oct 2025).
Task-Specific Exit Calibration: The utility of modality decomposition, per-modality exit depth, and dynamic thresholds depends on the task's vision/text complexity (Tang et al., 2022).
Integration with Other Efficiency Techniques: Further speedups are possible by combining SEE with quantization and pruning at a hardware/software stack level.
Limitations: Accurate navigation/situation labels are critical in scenario-guided setups. Misclassification can induce sub-optimal exits and impact safety.
Extensibility: SEE naturally extends to multi-modal document classification, video summarization, unified perception, trajectory prediction, and language-based planning across specialized domains.

A plausible implication is that SEE will accelerate next-generation resource-aware VLM deployments across edge devices and high-throughput production systems. Its blend of lightweight pruning, robust calibration, fine-grained control, and multi-modal generalizability marks it as a foundational pattern for efficient multi-modal inference (Zhao et al., 4 Dec 2024, Tang et al., 2022, Hamed et al., 21 May 2024, Khan et al., 30 Apr 2025, Hu et al., 2 Oct 2025).