Small Vision–Language Model Early Exiting

Updated 24 December 2025

The paper introduces SEE methods that enable dynamic early exits in transformer architectures, reducing decoder computation by up to 60% with minimal accuracy loss.
It employs deep supervision, adversarial feature alignment, and risk-controlled threshold calibration to ensure intermediate layers produce reliable outputs.
Empirical evaluations on tasks like image captioning and VQA demonstrate substantial latency reductions and improved efficiency under resource constraints.

Small Vision–LLM Early Exiting (SEE) refers to a family of methods that accelerate inference in multi-layer vision–LLMs by allowing outputs to be generated from intermediate network layers based on real-time confidence estimation. Rather than running the entire model for every prediction, SEE mechanisms dynamically decide, per input and per decoding step, whether computation can be “exited early” without severely compromising output quality. SEE draws on technical insights and validation from a breadth of literature, notably including dynamic early-exit training with deep supervision (Tang et al., 2023), adversarial alignment of intermediate representations (Bajpai et al., 7 Jun 2025), and distribution-free, post-hoc calibration for risk-controlled deployment (Jazbec et al., 31 May 2024). Typical applications include image captioning, visual question answering (VQA), and other multimodal vision–language tasks executed under latency or resource constraints.

1. Conceptual Foundations and Early-Exit Architectures

The central architecture underpinning SEE consists of a stack of $L$ layers—often transformer blocks—in which lightweight exit heads (classification, retrieval, or generation modules) are attached to intermediate layers. Each exit head computes a predictive distribution $\hat{p}_\ell(y|x,t)$ and a confidence score $c_\ell(x, t) \in [0, 1]$ , with $x$ denoting the image input and $t$ a possible text prompt. Different methodological families specify how these exits are constructed and trained:

Multi-Exit Decoders with Deep Supervision: Each decoder layer receives a copy of the output head (possibly with adaptation layers for shallow outputs). Deep supervision is enforced by training all exits simultaneously, e.g., using layerwise cross-entropy, and combined final objectives such as $L_\text{total} = L_\text{avg} + L_N$ (Tang et al., 2023).
Adversarially-Aligned Exits: Each exit contains a trainable “exit transformer” that maps intermediate features toward the distribution of final-layer features. A GAN framework is employed: the generator is the exit transformer, the discriminator is a feature classifier distinguishing between shallow and true final-layer features, and the classifier at each exit is frequently reused from the final output layer (Bajpai et al., 7 Jun 2025).

The following table summarizes representative SEE exit architectures:

Research Line	Exit Mechanism	Training Approach
DEED (Tang et al., 2023)	All decoder layers, shared head+adapt.	Deep supervision, AMs
FREE (Bajpai et al., 7 Jun 2025)	Sparse exits, single transf. layer	GAN-based adversarial align
Risk Control (Jazbec et al., 31 May 2024)	All layers, custom heads	Joint backbone/head training

2. Training Strategies and Feature Alignment

SEE systems demand that intermediate exits can make accurate predictions with minimal further computation. Deep supervision achieves this by attaching the predictive head to each exit and enforcing explicit prediction loss at every depth. To mitigate semantic drift and align the feature distributions of shallow and deep layers:

Shared Output Heads: A single, large-layer vocabulary projection (e.g. $W_g$ , $b_g$ ) is used across exits, ensuring output distribution compatibility (Tang et al., 2023, Bajpai et al., 7 Jun 2025).
Adaptation Modules or Exit Transformers: For shallow exits, adaptation modules (linear projection + LayerNorm per early layer) or single-layer transformers (in the GAN setting) are used to drive intermediate features towards the “semantic space” of the final output.
Adversarial Feature Alignment: The GAN-based criterion combines a discriminator and generator loss so the intermediate features “fool” the discriminator into being indistinguishable from final-layer outputs (Bajpai et al., 7 Jun 2025).

Shallow exits directly benefit from pre-training with deep supervision, yielding up to +3% performance at the shallowest layers in VQA tasks (Tang et al., 2023).

3. Early-Exit Criteria and Post-Hoc Calibration

Inference in SEE frameworks hinges on dynamic, per-example early-exit decisions. The canonical criterion is a confidence threshold: exit at layer $\ell$ if $c_\ell(x, t) \geq \lambda$ , else continue to deeper layers. For token-generation models, this is applied step-wise:

Confidence Score ( $c_\ell$ or $S_i$ ): Typically, $c_\ell$ is $\max_k \hat{p}_\ell(y=k|x,t)$ . For retrieval tasks, the margin between top-1 and top-2 predictions is sometimes used (Jazbec et al., 31 May 2024).
Threshold Calibration via Risk Control: Rather than heuristically picking $\lambda$ , distribution-free, finite-sample guarantees are sought. Calibration on a held-out dataset estimates the “performance-gap risk” $R(\lambda)$ :

$R(\lambda) = \mathbb{E}_{(x,t,y)\sim P}[\ell(\hat{y}_\lambda, y) - \ell(\hat{y}_L, y)]$

CRC (Control in Expectation) or UCB (Upper Confidence Bound) protocols yield data-driven $\hat\lambda$ thresholds that meet user-specified risk bounds, e.g., Δerror ≤ 5% (Jazbec et al., 31 May 2024).

4. Inference-Time Algorithms and Just-in-Time Computation

The runtime algorithm, particularly for decoder-based architectures, recomputes only those key–value (KV) caches and hidden states needed to align the exit's semantic features across decoding steps:

Step-Level Dynamic Early Exit: At each output step, hidden states and KV caches are updated just-in-time when a deeper layer is visited, ensuring semantically aligned inputs and avoiding the copying of shallow cache entries (infeasible due to non-uniform exits) (Tang et al., 2023).
Autoregressive Decoding Logic: For each token, iterate through exit layers to compute output distributions, stop at the first exit with $c_n^i > \tau$ (confidence threshold), and output the predicted token. Retained caches facilitate efficient re-entry to deeper layers when necessary.

5. Empirical Performance and Quantitative Analysis

SEE strategies deliver consistent compute–accuracy trade-offs across diverse benchmarks:

Decoder Latency Reductions: On LaTr++ Base (12-layer decoder) for DocVQA, DEED reduced decoder time by 55.8% (from 104.3 ms to 46.1 ms), with an increase in ANLS from 81.5 to 81.9. LaTr++ Large (24 layers) on DocVQA exhibited 72.9% decoder speedup and a slight ANLS improvement (Tang et al., 2023).
BLIP-2 VLMs with FREE: 1.5–1.7× speedups in image captioning with negligible changes in CIDEr score; robust to Gaussian noise perturbations (Bajpai et al., 7 Jun 2025).
Halving Average Inference Depth under Tight Risk Control: SEE with CRC control realized 50% fewer layers used (4.0 vs 8.0) in VQA v2.0 with ≤ 5% increase in error (Jazbec et al., 31 May 2024).

The following table highlights speed–accuracy trade-offs from key SEE studies:

Model/Task	Speedup	Accuracy Change	Reference
LaTr++ Base/DocVQA	–55.8% dec.	+0.4 ANLS	(Tang et al., 2023)
FREE BLIP-2, COCO	1.63×	+0.7 CIDEr	(Bajpai et al., 7 Jun 2025)
VQA v2.0, CRC risk ≤5%	2× (avg. layers)	≤5% Δerr	(Jazbec et al., 31 May 2024)

6. Phenomena: Overthinking and Mid-Crisis

SEE exposes and mitigates specific internal dynamics of deep decoder models:

Overthinking: Running to the final layer on easy inputs incurs redundant computations without improving accuracy; early exits reduce this effect (Bajpai et al., 7 Jun 2025).
Mid-Crisis: In frozen decoders, accuracy can dip at intermediate layers before recovering at the final exit, i.e., $\exists k_\text{mid}$ s.t. $\text{Acc}(k_\text{mid}) < \text{Acc}(k_\text{mid}-1),\,\text{Acc}(k_\text{mid}) < \text{Acc}(N)$ . FREE addresses this by adversarially aligning mid-level features, flattening the “mid-crisis” dip (Bajpai et al., 7 Jun 2025).

7. Best Practices and Practical Considerations

Deployment of SEE methods is regulated by several implementation and calibration choices:

Calibration Set Size: 200–500 samples suffices for CRC, but 1,000–2,000 for UCB (high-probability guarantees) is recommended (Jazbec et al., 31 May 2024).
Threshold Selection: Uniform discretization of $\lambda$ (e.g. steps of 0.01–0.02) enables precise trade-off control.
Expected Gains: 30–60% reduction in average computation at Δerror ≤5%. Gains scale with application risk tolerance and monotonicity of exit calibration.
Failure Modes: Poorly calibrated exits or non-monotonic accuracy across layers may necessitate retraining heads or revert to deeper inference (Jazbec et al., 31 May 2024).

SEE methodologies are model-agnostic among V–L networks supporting auxiliary heads or exits and are immediately applicable to a range of pretrained architectures. Adjustable speed–accuracy trade-offs are quantifiable, user-controllable, and robust to application- or domain-specific risk requirements.