Papers
Topics
Authors
Recent
2000 character limit reached

ViRC Framework: Visual Interleaved Reasoning

Updated 23 December 2025
  • ViRC Framework is a multimodal approach that segments the reasoning process into Critical Reasoning Units (CRUs) for interleaved visual and textual analysis.
  • It utilizes a staged curriculum with instructional SFT, practice SFT, and strategic RL to align model training with human cognitive strategies.
  • Empirical evaluations on benchmarks like GeoQA and MathVista-Math reveal significant accuracy improvements and robust performance over baseline models.

ViRC (Visual interleaved Reasoning with Chunking) is a multimodal framework designed to address the shortcomings of existing Multimodal LLMs (MLLMs) in mathematical reasoning tasks that require interleaved visual and textual cognition. Unlike standard approaches, ViRC introduces a hierarchical Reason Chunking mechanism—structuring the Chain-of-Thought (CoT) process into contiguous Critical Reasoning Units (CRUs)—that enables stepwise, proposition-centric inference matched to the cognitive strategies of human experts. This structured approach is supported by the CRUX dataset, meticulously annotated at the CRU level, and trained in a staged curriculum reflecting progressive human learning. Empirical evaluation demonstrates substantial improvements over leading baselines, both in accuracy and qualitative robustness, and suggests new directions for cognitively inspired multimodal reasoning architectures (Wang et al., 16 Dec 2025).

1. Motivation and Cognitive Foundations

The motivation for ViRC arises from the inadequacy of current MLLMs in mathematical visual tasks. Typical systems rely on two suboptimal strategies:

  • Static visual encoding: encoding the entire diagram once, followed by purely textual CoT steps; this approach neglects fine-grained visual cues essential for intermediate proposition verification.
  • Overly dense visual interleaving: alternating every reasoning step with a visual token (full image or cropped patch), which often results in irrelevant or redundant visual grounding.

ViRC draws on cognitive theories—most notably Miller’s Law, which posits a working-memory chunking limit of approximately seven ± two coherent units—and observations that expert human solvers naturally segment problems into intermediate logical propositions, attending only to pertinent subregions of a visual scene at each step. The framework operationalizes these insights by enforcing a hierarchy of reasoning units and selective visual attention, encoded within the CRU mechanism.

2. Architectural Overview and the Reason Chunking Mechanism

At its core, ViRC organizes the reasoning process as a sequence of CRUs: CRU(i)=(v(i),{s(i,1),...,s(i,mi)})\text{CRU}^{(i)} = (v^{(i)}, \{s^{(i,1)}, ..., s^{(i,m_i)}\}) where v(i)v^{(i)} represents visual context acquired by a tool call (e.g., crop, scale, display), and {s(i,)}\{s^{(i,\ell)}\} are coherent textual steps addressing an intermediate proposition. The overall inference trajectory is: [CRU(1),CRU(2),...,CRU(N),answer][\text{CRU}^{(1)}, \text{CRU}^{(2)}, ..., \text{CRU}^{(N)}, \text{answer}] with NKN \ll K, the total step count, enforcing that a small number of semantically meaningful visual–text chunks govern the reasoning process.

Within each CRU, intra-unit textual coherence is enforced; multiple sentences reason about the same intermediate goal over a single visual context. Inter-unit transitions are mediated via explicit tool calls, allowing the model to selectively fetch new visual data—targeting different subregions, zoom levels, or previously visited areas as needed. Implementation uses a frozen vision encoder, a pretrained LLM backbone, and an agentic inference loop that structures memory, tool prediction, and step generation at the CRU level.

3. Data and Supervision: The CRUX Dataset

Training the ViRC model requires fine-grained CRU-level annotations, provided by the CRUX dataset. Construction of CRUX involves:

  • Sampling diverse reasoning paths per problem and image at multiple resolutions.
  • Decomposing each path into primitive reasoning steps, each annotated by focal object.
  • Grouping sequential steps sharing the same intermediate proposition into a single CRU, and integrating explicit visual grounding for each CRU via visual tool calls (crop, scale, display).
  • Annotating reasoning patterns: Planning, Reflecting, Verifying, and Backtracking, to structure both step content and tool usage.

CRUX statistics indicate an average of 4.27 CRUs per problem, with documented diversity in planning and verification strategies. This granularity enables supervision of both structural and visual grounding aspects in ViRC.

4. Training Curriculum and Optimization

ViRC leverages a staged progressive training strategy reflecting cognitive learning:

  • Instructional SFT: Teaches CRU structure in a text-only regime (visual data masked out), using cross-entropy loss over tool calls and reasoning steps.
  • Practice SFT: Inserts instantiated visual crops per CRU, training with full multimodal data and the same cross-entropy objective.
  • Strategic RL: Refines CRU and tool selection on a curated hard subset using Group Relative Policy Optimization (GRPO). The reward incorporates final answer correctness, multimodal matching (text and visual), pattern fidelity, and output validity.

Mathematically, supervised loss is: LSFT=logpθ(tokencontext)\mathcal{L}_{\text{SFT}} = -\sum \log p_\theta(\text{token}^* \mid \text{context}) and reinforcement learning uses a normalized advantage estimator, with detailed reward composition for answer, multimodal structure, pattern, and formatting.

5. Benchmark Results and Empirical Insights

ViRC-7B demonstrates dominant performance on key multimodal mathematical benchmarks, including GeoQA, MMStar-Math, and MathVista-Math:

  • GeoQA: 75.07% vs 62.86% for MM-Eureka-7B.
  • MathVista-Math: 81.11% vs 72.59% for MM-Eureka-7B.
  • Average improvement over Qwen2.5-VL-7B-Instruct: +18.8%.

Ablation studies underscore the necessity of CRU chunking—removal reduces accuracy by 7.7 percentage points. Combined reasoning patterns (planning, reflecting, verifying, backtracking) produce synergistic effects. Training without the hard RL subset or pattern rewards results in measurable regression of generalization and accuracy. The CRU mechanism is directly responsible for performance gains and increased robustness across datasets.

6. Limitations and Open Challenges

Observed limitations include:

  • Heavy dependence on the large-scale, richly annotated CRUX dataset for effective training.
  • Inference-time latency, due to alternating tool calls and multi-step CRUs.
  • Sensitivity to reward weighting and hard subset selection in the RL fine-tuning stage.

Future directions outlined include:

  • Discovery of chunk boundaries via unsupervised learning.
  • Extension of the CRU mechanism to video and 3D scene reasoning ("temporal" or "depth" chunks).
  • Design of richer visual analytic tools (e.g., measurement, angle-drawing) to augment reasoning interfaces.
  • Incorporation of additional cognitive patterns such as explanation, self-critique, and reflection.
  • Broader application to other domains with formal visual reasoning (physics, chemistry diagrams).

7. Relevance and Impact

ViRC’s Reason Chunking provides a principled, interpretable, and empirically validated approach to aligning MLLM reasoning with known human expert strategies in visual mathematics. The framework establishes new state-of-the-art results, enables structured interpretability of model inference, and opens paths to more generalizable, cognitively grounded multimodal reasoning systems (Wang et al., 16 Dec 2025). The codebase and CRUX dataset are publicly available, facilitating further research and independent benchmarking.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to ViRC Framework.