RecursiveVLM: Efficient Recursive Multimodal Model

Updated 19 February 2026

RecursiveVLM is a parameter-efficient multimodal model that employs recursive transformer passes to iteratively refine feature representations.
It uses a specialized Recursive Connector to align and fuse vision and language tokens across selected layers, ensuring robust multimodal integration.
A novel monotonic recursion loss guarantees non-degrading prediction quality, enabling adjustable inference depth for a balance between speed and accuracy.

RecursiveVLM refers to a parameter-efficient large multimodal model (LMM) architecture that utilizes recursive refinement over Transformer layers to improve multimodal reasoning, vision–language understanding, and representational robustness, while maintaining a fixed parameter budget. The approach was introduced in "Looping Back to Move Forward: Recursive Transformers for Efficient and Flexible Large Multimodal Models" (Xu et al., 9 Feb 2026). RecursiveVLM is distinguished from related recursive or iterative VLM frameworks by its specific architectural innovations: a Recursive Connector, which manages the feedback and fusion of intermediate hidden states, and a Monotonic Recursion Loss, which directly regularizes each recursion step to guarantee non-degrading prediction quality as recursion depth increases.

1. Architectural Foundations

RecursiveVLM is built upon a standard VLM skeleton comprising a vision encoder (producing image embeddings $V \in \mathbb{R}^{N_v \times d}$ ), a text encoder ( $T \in \mathbb{R}^{N_t \times d}$ ), and a shared Transformer decoder with $L$ layers, denoted as $\theta_\ell$ for each layer $\ell$ . Unlike conventional VLMs that process the input sequence in a single forward pass through the $L$ layers, RecursiveVLM repeatedly applies the same set of Transformer layers $R$ times (recursion depth), at each step leveraging information from previous recursions. At step $r$ , the model computes activations $H_\ell^{(r)} = \mathcal{F}(H^{(r)}_{\ell-1};\theta_\ell)$ and then re-initializes its input embedding $E^{(r+1)}$ via the Recursive Connector before the next loop.

The recursion structure forms a looped computational graph:

Input at recursion $r$ : $E^{(r)} = [V^{(r)}, T^{(r)}]$
Pass through $L$ shared Transformer layers (parameters $\theta_\ell$ )
Extract intermediate activations $\{H_\ell^{(r)}\}$
Update input for next recursion via the Recursive Connector
Repeat for $R$ recursions; apply the output language modeling head at each step

This design realizes parameter reuse across recursion steps, enabling progressively refined representations and predictions without network inflation. The ability to control $R$ at inference allows trade-offs between computational cost and output quality.

2. Recursive Connector: Feature Alignment and Fusion

The Recursive Connector resolves two critical challenges unique to stacking recursive passes: distributional misalignment across recursion depths and divergent statistical properties between vision and language tokens. Its primary operations are:

Multi-layer fusion: From the $L$ Transformer layers, a uniform subset $S = \{\lfloor L/4 \rfloor, \lfloor 2L/4 \rfloor, \lfloor 3L/4 \rfloor, L\}$ is selected. For each $\ell \in S$ , $H^{(r)}_\ell$ is partitioned into vision and language components.
Modality-specific projection: For each modality $m \in \{v,t\}$ $m \in {v, t}$ (vision, text) and layer, the connector applies:
1. RMS normalization: $\tilde X^{(r)}_{\ell,m} = \text{RMSNorm}(X^{(r)}_{\ell,m})$
2. MLP with learnable residual scaling $s_{\ell,m}$ , producing:
$A^{(r)}_{\ell,m} = \tilde X^{(r)}_{\ell,m} \odot s_{\ell,m} + \sigma(\tilde X^{(r)}_{\ell,m} W^u_{\ell,m}) W^d_{\ell,m},$

where $W^u_{\ell,m}, W^d_{\ell,m}$ are learned weights and $\sigma$ is an activation.
Recursive feedback: The projections are summed and added to the initial embeddings:

$V^{(r+1)} = V^{(1)} + \sum_{\ell \in S} A^{(r)}_{\ell,v}, \quad T^{(r+1)} = T^{(1)} + \sum_{\ell \in S} A^{(r)}_{\ell,t}$

$E^{(r+1)}$ then concatenates $V^{(r+1)}$ and $T^{(r+1)}$ for the next recursion.

This connector design allows for effective feedback without the collapse or drift often observed in naïve recursive architectures. Initialization of the scaling terms $s_{\ell,m}$ to zero ensures that the initial pass ( $r=1$ ) recovers the original pretrained model before any refinement.

RecursiveVLM employs a novel token-wise monotonic loss to enforce that output quality, as measured by cross-entropy (CE) token loss, does not degrade with increasing recursion depth. For token $i$ at recursion $r$ ,

Compute raw CE loss $\ell_i^{(r)}$ at every $r$ .
For $r>1$ , if $\ell_i^{(r)} > \ell_i^{(r-1)}$ , rescale with a factor $\beta > 1$ (empirically $\beta=1.5$ ):

$\hat{\ell}_i^{(r)} = \begin{cases} \beta \cdot \ell_i^{(r)} & \text{if } \ell_i^{(r)} > \ell_i^{(r-1)} \ \ell_i^{(r)} & \text{otherwise} \end{cases}$

Compute mean loss for recursion $r$
Optimize the total loss $L_\text{total} = \sum_{r=1}^R L^{(r)}$

This multi-step supervision enables the model to yield usable outputs at any recursion depth and to ensure that additional refinement steps yield strictly non-inferior (often better) results compared to previous steps.

4. Training Protocol and Inference Adaptivity

RecursiveVLM is trained for a fixed recursion count, typically $R=2$ , with the Recursive Connector initialized in a pass-through (zero-scaling) state to replicate baseline VLM performance at $r=1$ . The AdamW optimizer is used for parameter updates. Training involves standard multimodal pretraining and supervised fine-tuning, with all recursion depths jointly optimized via $L_\text{total}$ .

At inference, recursion depth $R$ becomes a tunable, deployment-time parameter—single-pass ( $r=1$ ) is available for fast, resource-constrained scenarios, while additional recursions ( $r=2$ or more) can be run in high-availability or accuracy-critical settings. Each recursion increment adds the computational cost of a standard forward pass through the shared transformer stack, with negligible connector overhead.

The following pseudocode, adapted from the source, summarizes the loop:

for r in range(1, R+1):
    H_r = TransformerLayers(E_r)
    logits = LMHead(H_r[-1])
    loss_r = cross_entropy(logits, targets)
    if r > 1:
        # Apply monotonic penalty
        loss_r = monotonic_scale_if_needed(loss_r, loss_{r-1})
    if r < R:
        # Recursive Connector fusion
        E_{r+1} = Connector(H_r, V_1, T_1)
total_loss = sum([loss_1, ..., loss_R])

(Xu et al., 9 Feb 2026)

5. Experimental Results and Empirical Analysis

RecursiveVLM was evaluated across eight benchmarks, including diagram understanding (AI2D), complex reasoning (MM-Star), multimodal vetting (MM-Vet), multi-discipline understanding (MMMU), general multimodal benchmarks (MMB), math reasoning (MathVista), OCR robustness (OCRBench), and hallucination (HallusionBench) (Xu et al., 9 Feb 2026). Key empirical findings include:

At $r=2$ , RecursiveVLM improves mean test accuracy by approximately $+3\%$ over a non-recursive baseline (58.9 vs. 55.3) and $+7\%$ over a vanilla recursion loop lacking connector and monotonic loss.
On reasoning-focused data, increases up to $+4.0$ points are observed (62.6 vs. 58.6).
Hallucination reduction is observed with increasing recursion: HallusionBench errors decrease as $R$ increases (37.6 → 38.7 → 45.1 for $R=1,2,3$ ).
Ablation studies show elimination or simplification of any connector component (multi-layer fusion, MLP, residual scaling) degrades performance substantially (up to $-8\%$ ).
Modality-specific connector parameters outperform shared parameters by $+2.3\%$ on average.

A table summarizing accuracy gains:

Benchmark	Baseline (R=1)	RecursiveVLM (R=2)	Vanilla Recursion
Mean Score	55.3	58.9 (+3.6)	50.1
HallusionBench	37.6	38.7 (R=2), 45.1 (R=3)	–
Reasoning-Intensive	58.6	62.6 (+4.0)	–

6. Design Analysis and Future Prospects

RecursiveVLM achieves high deployment efficiency: the only architectural addition is the small connector MLPs, which increase model size by only $1–2\%$ . Recursion depth is a flexible knob for balancing throughput and accuracy, making the approach well-suited to both on-device (low-latency) and cloud-based (maximum quality) settings.

Possible extensions of the RecursiveVLM framework include:

Application to unimodal LLMs and to multilingual models via modality- or language-specific connectors.
Dynamic recursion control, potentially at token granularity, using adaptive gating or model confidence signals.
Transfer of the recursive connector paradigm to vision-only models (e.g., through iterative multi-scale feature fusion in DETR).
Integration with mixture-of-experts gating or parameter-efficient adapters for additional speed–capacity trade-offs.

A plausible implication is that recursive parameter reuse—when combined with distributionally aware hidden-state connectors and strict monotonic supervision—effectively closes the performance gap between pretrained LMMs and larger or higher-capacity monolithic models, but with substantially improved deployment flexibility.

7. Relationship to Other Recursive Multimodal Methods

RecursiveVLM's recursion-over-transformer-layers is orthogonal to recursion-over-inputs as seen in ROVER (Schroeder et al., 3 Aug 2025), which recursively segments video sequences for hierarchical subtask reasoning, and to output correction recursion as in RIV (Li et al., 28 Sep 2025), where denoising and introspection are alternated to self-correct generated outputs. RecursiveVLM performs internal recursive refinement via shared-layer processing and multi-modal connectors, focusing on iterated representational enhancement rather than explicit input segmentation or token-level remasking.

This distinction situates RecursiveVLM as a general-purpose, architecture-level recursive solution for flexible and efficient multimodal modeling, with documented state-of-the-art benchmark results and extensibility across deployment contexts (Xu et al., 9 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (3)

Looping Back to Move Forward: Recursive Transformers for Efficient and Flexible Large Multimodal Models (2026)

ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks (2025)

RIV: Recursive Introspection Mask Diffusion Vision Language Model (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RecursiveVLM.

RecursiveVLM: Efficient Recursive Multimodal Model

1. Architectural Foundations

2. Recursive Connector: Feature Alignment and Fusion

3. Monotonic Recursion Loss: Progressive Refinement Guarantee

4. Training Protocol and Inference Adaptivity

5. Experimental Results and Empirical Analysis

6. Design Analysis and Future Prospects

7. Relationship to Other Recursive Multimodal Methods

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

RecursiveVLM: Efficient Recursive Multimodal Model

1. Architectural Foundations

2. Recursive Connector: Feature Alignment and Fusion

3. Monotonic Recursion Loss: Progressive Refinement Guarantee

4. Training Protocol and Inference Adaptivity

5. Experimental Results and Empirical Analysis

6. Design Analysis and Future Prospects

7. Relationship to Other Recursive Multimodal Methods

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics