Interleaved Vision-Text Latent Reasoning (IVT-LR)
- Interleaved Vision-Text Latent Reasoning (IVT-LR) is a multimodal approach that alternates between latent textual and visual representations during reasoning.
- It employs recurrent multimodal state exchange by fusing continuous visual tokens with latent text, enabling efficient state updates during decoding.
- Empirical evaluations on benchmarks like ScienceQA and FutureBench demonstrate improved accuracy and inference speed using this interleaved reasoning strategy.
Searching arXiv for the cited papers and closely related work on interleaved vision-text latent reasoning. Interleaved Vision-Text Latent Reasoning (IVT-LR) is a multimodal reasoning paradigm in which intermediate computation is carried out in latent space while alternating, coupling, or jointly updating textual and visual representations rather than relying exclusively on explicit textual chain-of-thought. In the narrowest sense introduced by "Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space" (Chen et al., 14 Oct 2025), each reasoning step combines latent text—the hidden state from the previous step—and latent vision—a set of selected image embeddings. In the broader literature, the term also covers closely related designs that interleave text tokens with continuous visual latent spans during autoregressive decoding, as in LVR, Mirage, Future-L1, and SwimBird, while excluding architectures that merely use a single latent visual prefix, external tool-mediated visual feedback, or purely textual surrogates for vision (Li et al., 29 Sep 2025).
1. Definition, scope, and taxonomy
The strict formulation of IVT-LR treats reasoning as a sequence of latent multimodal updates. In (Chen et al., 14 Oct 2025), the input at latent step is
with the next-token distribution computed from a fused multimodal state. This makes IVT-LR distinct from text-only CoT, because visual evidence is reintroduced at each reasoning step, and distinct from explicit multimodal CoT, because the intermediate states remain continuous rather than fully verbalized (Chen et al., 14 Oct 2025).
The literature around the term is heterogeneous. Some systems are direct matches because they interleave text and visual latents during decoding. Others are partial matches because they optimize only visual latent spans before answer generation, or because they externalize multimodal reasoning through tools and rendered feedback rather than latent states. This suggests that IVT-LR is best understood as a family of designs centered on recurrent multimodal state exchange, with variation in whether the exchanged state is continuous, symbolic, explicit, or tool-mediated.
| Method | Reasoning substrate | Relation to IVT-LR |
|---|---|---|
| IVT-LR (Chen et al., 14 Oct 2025) | Latent text + selected image embeddings | Canonical formulation |
| LVR (Li et al., 29 Sep 2025) | Text tokens + latent visual hidden states | Direct partial instantiation |
| Mirage (Yang et al., 20 Jun 2025) | Text + latent visual tokens | Direct interleaved variant |
| SwimBird (Tong et al., 5 Feb 2026) | Text tokens + continuous visual thoughts | Adaptive IVT-LR mode |
| Future-L1 (Jiang et al., 4 Jun 2026) | Text tokens + latent visual spans | Strong interleaved variant |
| StruVis (Lyu et al., 6 Mar 2026) | Structured textual vision | Textual approximation |
| UniVLR (Jiang et al., 12 May 2026) | Unified visual latent channel | Non-interleaved alternative |
A recurrent misconception is that any model with latent visual tokens is automatically an IVT-LR system. The literature does not support that equivalence. LaViT, for example, inserts latent visual thoughts before answer generation but does not alternate visual and textual latent states over time, making it staged rather than fully interleaved (Wu et al., 15 Jan 2026). UniVLR goes further in the opposite direction: it explicitly rejects interleaving and instead absorbs textual reasoning into a single unified visual workspace at inference time (Jiang et al., 12 May 2026).
2. Architectural patterns
The most direct IVT-LR architectures use a single autoregressive decoder that can switch between discrete text generation and continuous latent-state propagation. LVR marks latent mode with <|lvr_start|> and <|lvr_end|>. Once latent mode is entered, the model feeds the last hidden state forward as the next input embedding, training those hidden states to reconstruct question-relevant visual tokens in a shared multimodal semantic space (Li et al., 29 Sep 2025). Mirage implements the same basic idea with latent visual tokens inserted into an otherwise textual reasoning trajectory, using a fixed latent segment of length after a textual prefix and before later text (Yang et al., 20 Jun 2025). Future-L1 generalizes this to multiple latent spans per response, controlled by <|latent_start|>, <|latent|>, and <|latent_end|>, so that text and latent visual spans can alternate several times within one decoding trace (Jiang et al., 4 Jun 2026).
SwimBird is architecturally broader. It unifies next-token prediction for textual thoughts and next-embedding prediction for visual thoughts inside a hybrid autoregressive decoder, with delimiters such as <|latent_start|> and <|latent_end|> making mode switching part of decoding itself (Tong et al., 5 Feb 2026). The important distinction is that SwimBird does not force interleaving on every query. It supports three modes—text-only, vision-only, and interleaved vision-text reasoning—and learns when to use each. This suggests that IVT-LR may be better treated as an optional mode within a larger reasoning policy rather than a universally optimal decoding template.
A second architectural pattern externalizes the interleaving loop instead of internalizing it. VisuoThink alternates Thought, Action, and Observation: text reasoning proposes a visual action, an external tool renders or updates a visual state, and the resulting image or executor feedback is fed back into the next reasoning step (Wang et al., 12 Apr 2025). This is genuine interleaving between language and visual state, but not latent interleaving in the strict representation-learning sense. A related but task-specific example is IVLR for long-horizon manipulation, where a multimodal transformer generates an explicit trace
of textual subgoals and visual keyframes, caches it, and conditions closed-loop action decoding on that trace (Liu et al., 1 May 2026). This is explicit interleaving rather than latent reasoning, but it is part of the same design space.
3. Intermediate-state design
The central design question in IVT-LR is what counts as the “visual” part of the intermediate state. In the strict latent formulations, the visual state is continuous. LVR uses last-layer hidden states trained to match projected visual token embeddings selected from regions of interest (Li et al., 29 Sep 2025). Mirage compresses helper-image patch features into a small number of continuous vectors and then lets the model generate analogous latent visual tokens autoregressively (Yang et al., 20 Jun 2025). Future-L1 aligns latent states to future-frame embeddings extracted by the Qwen3-VL vision encoder, thereby treating imagined future scene evolution as a continuous latent span rather than a textual hypothesis (Jiang et al., 4 Jun 2026). SwimBird likewise defines “visual thoughts” as continuous hidden-state embeddings supervised against embeddings of intermediate thinking images (Tong et al., 5 Feb 2026).
A more structured variant appears in the original IVT-LR paper. There, latent text is the hidden state from the previous step, while latent vision is a top- subset of image embeddings selected by cumulative attention across layers. The multimodal latent step is the concatenation , which is appended back into the sequence before the next reasoning step (Chen et al., 14 Oct 2025). This design is notable because it makes the visual state sparse and query-conditioned rather than a full copy of all visual tokens.
Several neighboring methods clarify what IVT-LR is not. StruVis introduces a “text-based structured visual representation” serialized as JSON, containing object entities, relationships, and spatial layouts. Its trajectory is effectively
where the “visual” intermediate state is symbolic text rather than a learned visual latent (Lyu et al., 6 Mar 2026). This suggests a lightweight textual approximation to interleaved visual reasoning, but not IVT-LR in the strict latent sense. UniVLR departs further by rendering textual reasoning and auxiliary images into a single visual canvas and compressing that into a fixed latent sequence, so that inference proceeds through a unified visual latent channel without explicit text CoT (Jiang et al., 12 May 2026). Laser is also only a partial match: its latent trajectory is a fused multimodal semantic trajectory supervised by future textual visual-concept chains, but it does not alternate distinct visual and textual latent streams (Wang et al., 11 Jan 2026).
4. Optimization and supervision
Training IVT-LR systems has generally required more than ordinary next-token learning, because the model must both generate useful latent states and ensure that later reasoning actually depends on them. The original IVT-LR framework addresses this through a progressive multi-stage curriculum: stage 0 uses explicit CoT; later stages replace one additional explicit reasoning step at a time with a latent step marked by \<latent\>; supervision is applied only to the remaining explicit steps and the final answer (Chen et al., 14 Oct 2025). This design attempts to internalize reasoning gradually rather than forcing full latent reasoning from the start.
Mirage uses a different two-stage scheme. Stage 1 grounds latent tokens against compressed helper-image embeddings with a cosine visual loss plus text cross-entropy; stage 2 removes the direct visual loss and lets the self-generated latent tokens adapt to downstream reasoning purely through text supervision (Yang et al., 20 Jun 2025). The reported ablation on VSP planning—0.58 for full Mirage, 0.52 without Stage 1, and 0.21 without Stage 2—shows that both grounding and later relaxation matter. LVR similarly combines supervised latent reconstruction with reinforcement learning via , where policy optimization is defined over text tokens while replaying the latent trajectory as fixed hidden context (Li et al., 29 Sep 2025). Future-L1 follows the same pattern of SFT plus latent-aware RL, but with explicit latent rewards: an outcome-contrastive reward aligns successful latent trajectories with other successful ones, and a temporal-diversity reward discourages adjacent latent spans from collapsing into repeated states (Jiang et al., 4 Jun 2026).
The optimization problem becomes sharper in methods that diagnose latent underutilization directly. "Visual Latents Know More Than They Say" identifies Silenced Visual Latents, where latent tokens become semantically enriched yet are bypassed during answer prediction because the autoregressive objective can route prediction through direct visual input instead (Zhang et al., 4 May 2026). Its solution is inference-time latent optimization with frozen backbone parameters: Stage I performs query-guided contrastive latent–visual alignment, and Stage II applies a confidence-progression reward that encourages token distributions along the latent span to become progressively more concentrated. DMLR reaches a related conclusion from a training-free direction. It inserts dedicated latent think tokens, perturbs them with Gaussian noise, and optimizes them at test time using a confidence reward based on truncated entropy, while dynamically injecting the most relevant visual patches after each latent token (Han, 14 Dec 2025). Both papers imply that latent reasoning channels do not become useful automatically; they require either explicit anti-bypass optimization or test-time latent search.
A further complication is that latent generation may disappear under RL even when latent reasoning is beneficial during learning. "Leveraging Latent Visual Reasoning in Silence" shows that replacing latent tokens with random noise or removing them altogether often causes little degradation, and that post-training on Monet and LVR tends to suppress latent-mode frequency unless latent-text interaction is rewarded directly (Zhu et al., 18 May 2026). Its proposed attention-based reward,
encourages later text to attend to earlier latent tokens. This suggests that IVT-LR requires not only latent-state supervision, but also explicit optimization of cross-mode influence.
5. Empirical performance and application domains
The strongest empirical evidence for IVT-LR has come from perception-heavy benchmarks where text-only CoT is an obvious bottleneck. On MCoT and ScienceQA, the original IVT-LR paper reports 71.8 and 94.6 accuracy respectively with Qwen2-VL-7B, alongside about 10.0 and 11.0 autoregressive steps, compared with much longer explicit-reasoning baselines; the paper summarizes the overall effect as an average 5.45% accuracy increase and over 5 times speed increase (Chen et al., 14 Oct 2025). LVR reports 71.67% on MMVP versus 66.67% for Qwen2.5-VL, and 81.7 on 0 with the 7B backbone, supporting the claim that latent visual reconstruction improves fine-grained perception (Li et al., 29 Sep 2025). Mirage shows similar gains in spatial reasoning: on VSP planning, its CoT setting reaches 0.58, versus 0.47 for CoT SFT baseline, and 0.60 after GRPO (Yang et al., 20 Jun 2025).
Video event prediction offers a particularly natural setting for IVT-LR because future visual state is hard to verbalize. Future-L1 improves Qwen3-VL-8B from 61.0 to 85.4 on FutureBench and raises the average score on TwiFF-Bench from 2.44 to 3.04 by alternating text with continuous latent visual spans aligned to future-frame embeddings (Jiang et al., 4 Jun 2026). The gains are especially large on deeper and interpolation-heavy splits, which suggests that latent future imagination is more effective than purely textual future rationales on long-horizon event prediction.
Spatial reasoning and geometry also favor interleaving. VisuoThink, though not latent in the strict sense, improves geometry and spatial reasoning by alternating text deliberation with rendered visual aids and executor feedback. On Geomverse-109, GPT-4o rises from 11.1 with CoT to 28.9 with full VisuoThink; on Visual Navigation level-3, GPT-4o rises from 18.8 with CoT to 93.8 with VisuoThink (Wang et al., 12 Apr 2025). This indicates that repeatedly revisiting visual state during reasoning can matter as much as the specific latent representation used.
Embodied control provides a different but closely related application. IVLR for long-horizon manipulation uses an explicit interleaved text-image trace rather than latent reasoning, yet its ablations are highly informative: on LIBERO-Long, success drops from 92.4% with the full interleaved trace to 62.0% with text-only trace, 68.4% with vision-only trace, and 37.7% without trace (Liu et al., 1 May 2026). LaRA-VLA internalizes textual and visual CoT into a shared continuous latent space for action generation and reports 97.9 average success on LIBERO with up to 90% lower inference time than explicit CoT approaches (Bai et al., 1 Feb 2026). These results do not establish a single best IVT-LR architecture, but they do support the broader thesis that multimodal intermediate reasoning is most effective when both causal structure and geometric grounding remain available during inference.
6. Alternatives, critiques, and open problems
The IVT-LR literature is marked by an active dispute over how much explicit interleaving is actually necessary. UniVLR argues that most prior visual latent reasoning methods remain fragmented across separate text and vision channels, and proposes a unified visual workspace that removes explicit text CoT at inference time while using only 12 latent tokens by default and reporting a 15.2× reduction in generated reasoning tokens (Jiang et al., 12 May 2026). This is a deliberate challenge to IVT-LR as a design assumption: the paper’s claim is not that interleaving can be improved, but that it can be replaced.
A second critique concerns what counts as “visual” reasoning. StruVis is relevant precisely because it offers a counterexample: it improves reasoning-based text-to-image generation through text-based structured visual representations rather than images or learned visual latents, achieving a 4.61% gain on T2I-ReasonBench and a 4% gain on WISE (Lyu et al., 6 Mar 2026). This suggests that some benefits attributed to IVT-LR may actually come from introducing an explicit scene-level intermediate state, whether or not that state is continuous or visually encoded. A plausible implication is that the decisive variable is sometimes structuredness, not modality.
The strongest technical critique is that latent channels may be present but unused. "Leveraging Latent Visual Reasoning in Silence" shows that removing or randomizing latent tokens in Monet and LVR often causes little performance degradation, and that RL can suppress latent-mode generation rather than strengthen it (Zhu et al., 18 May 2026). "Visual Latents Know More Than They Say" reaches a related conclusion from within latent-space optimization itself, identifying shortcut reliance on direct visual input as the mechanism that silences latent spans (Zhang et al., 4 May 2026). These findings imply that benchmark improvements alone are insufficient evidence for causal latent reasoning. IVT-LR systems increasingly require counterfactual tests, attention or routing analyses, and latent-ablation studies to establish that the interleaved states are functionally necessary.
Open problems follow directly from these tensions. Adaptive routing remains unresolved: SwimBird shows that different tasks favor text-only, vision-only, or interleaved modes, but the paper does not formalize a unified mixed-sequence probabilistic model (Tong et al., 5 Feb 2026). Interpretability is limited because continuous latent spans are difficult to inspect, even when methods such as Laser claim decodable semantic trajectories (Wang et al., 11 Jan 2026). Compute allocation is also unsolved. Future-L1 finds 1 best for latent spans, while larger budgets hurt performance (Jiang et al., 4 Jun 2026); VisuoThink shows that more search depth helps only up to a point, and more tree width can reduce accuracy because branch scoring is imperfect (Wang et al., 12 Apr 2025). This suggests that IVT-LR is not simply a matter of adding more latent steps. The unresolved design question is how to allocate multimodal latent computation dynamically, causally, and verifiably.
In that sense, IVT-LR remains less a single architecture than a research program: one line seeks tighter interleaving between text and visual latents; another seeks unified latent workspaces that eliminate explicit text; a third questions whether latent traces must survive at inference at all. The common premise across these lines is stable: explicit textual CoT is often an inadequate interface for visual reasoning, and multimodal systems benefit when intermediate state can preserve visual semantics without collapsing into either pure text or raw pixels (Chen et al., 14 Oct 2025).