ReVisual-R1: Multimodal Reasoning Framework

Updated 30 June 2025

ReVisual-R1 is a state-of-the-art multimodal language model framework that integrates visual and textual inputs for advanced chain-of-thought reasoning.
It employs a three-stage reinforcement learning pipeline with Prioritized Advantage Distillation to overcome gradient stagnation and optimize multimodal performance.
The model outperforms larger counterparts on mathematical and logic benchmarks, demonstrating the efficacy of curriculum-driven, cross-modal training.

ReVisual-R1 is a state-of-the-art open-source Multimodal LLM (MLLM) framework specifically optimized for self-reflective, chain-of-thought reasoning in visuo-mathematical and logic tasks. It employs a strategically staged reinforcement learning (RL) curriculum, with explicit architectural and algorithmic choices to achieve robust, generalizable multimodal reasoning—outperforming or matching much larger models on a suite of mathematical and logic benchmarks. ReVisual-R1 is built atop the Qwen2.5-VL-7B-Instruct architecture and is notable for its curriculum design, handling of RL gradient stagnation, and the introduction of Prioritized Advantage Distillation (PAD) for optimization stability.

1. Architectural and Algorithmic Foundations

ReVisual-R1’s base is Qwen2.5-VL-7B-Instruct, a 7-billion parameter model that combines visual and linguistic representations. Inputs consist of a tuple $x = (v, q)$ , where $v$ is visual content and $q$ is the textual query. The model produces a multi-step, self-reflective reasoning trace $t$ , culminating in solution $y$ .

The learning objective maximizes expected reward over multimodal reasoning tasks: $\theta^* = \arg\max_\theta \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi_\theta(y|x)}[r(y, x)]$ where $r(y, x) = 1$ for correct $y$ , and 0 otherwise.

The primary reinforcement mechanism is Group Relative Policy Optimization (GRPO), a group-wise policy optimization approach that evaluates model outputs within dynamically constructed sample groups, using normalized advantage as the learning signal. The advantage for a sample is: $\hat{A}(x, y_i) = \frac{r(x, y_i) - \text{mean}_j(r(x, y_j))}{\text{std}_j(r(x, y_j)) + \epsilon}$ This group-wise normalization ensures stable updates and robust policy improvement.

2. Three-Stage Staged Reinforcement Optimization (SRO) Pipeline

ReVisual-R1’s pipeline consists of three sequential phases designed to maximize both linguistic and multimodal reasoning ability:

Stage 1: Textual Cold Start Initialization

High-difficulty, text-only chain-of-thought (CoT) samples (~40k) are used for initial fine-tuning. This phase instills advanced sequential reasoning skills, establishing an abstract reasoning foundation before any multimodal exposure. Results show that such text initialization alone gives performance that surpasses many models trained exclusively on multimodal data, suggesting textual complexity as a robust driver of reasoning capability.

Stage 2: Multimodal Reinforcement Learning

A curated set (~26k) of multimodal reasoning samples (from the GRAMMAR dataset) is used for RL. In this phase, the model’s linguistic reasoning is grounded in visual stimuli. Training employs GRPO, grouping inputs by type or topic to compute normalized advantages and update policy according to clipped surrogate objectives: $\mathbb{E}_{x \sim \mathcal{G}_i} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left[\min\left(\frac{\pi_\theta(y|x)}{\pi_{\theta_{\text{ref}}(y|x)}} \hat{A}(x, y), \text{clip}\left(\frac{\pi_\theta(y|x)}{\pi_{\theta_{\text{ref}}(y|x)}}, 1-\epsilon, 1+\epsilon\right) \hat{A}(x, y)\right)\right]$ A crucial finding is that standard GRPO with sparse/binary rewards suffers from “gradient stagnation” (zero advantages in groups with all-correct or all-incorrect samples), sharply reducing effective batch size and learning efficiency.

Stage 3: Textual Reinforcement Learning

Following multimodal RL, the model undergoes further RL on high-quality textual-only tasks (~30k), using only the frozen vision tower. This restores abstract textual reasoning and fluency, which can decay during multimodal fine-tuning—a phenomenon observed empirically.

3. Prioritized Advantage Distillation (PAD) and Gradient Stagnation

A central methodological advancement is the introduction of Prioritized Advantage Distillation (PAD) to address GRPO’s gradient stagnation.

Gradient Stagnation: In multimodal RL, groups with all-correct/all-incorrect outputs yield zero normalized advantage for every sample, providing no learning signal and wasting compute.
PAD Solution: Only samples with substantial, nonzero advantage ( $|\hat{A}_i|$ within preset thresholds) are retained for updates. From these, prioritized subsampling is applied: $\Pr(i|i\in\mathcal{E}) = \frac{\exp(\hat{A}_i/\tau)}{\sum_{j \in \mathcal{E}} \exp(\hat{A}_j/\tau)}$ with temperature parameter $\tau$ controlling exploration. This focuses RL updates on the most informative examples and allows the batch to remain effective even as easy tasks saturate.

An additional length-based reward is defined to encourage concise yet adequately detailed responses: $R_{\text{raw}} = \alpha (L_{\text{budget}} - L_y) + \delta,\qquad R_{\text{len}} = \max(0, \min(1, R_{\text{raw}}))$

4. Performance and Reasoning Capabilities

ReVisual-R1 demonstrates state-of-the-art accuracy among open-source 7B MLLMs across challenging multimodal mathematical and logic reasoning benchmarks, including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and AIME2024/2025.

Notable results include:

Significant gains (e.g., +44.5% accuracy on AIME2024 compared to previous open-source bests).
Performance on par with or exceeding much larger (and/or closed-source) models.
Robustness across both visuo-mathematical and logic domains, reflecting balanced development of linguistic and multimodal faculties.
Ablation studies show that each SRO component is necessary: textual cold start for reasoning depth, multimodal RL for cross-modal grounding, and textual RL for maintaining abstraction and fluency.

These achievements are attributable to:

Curriculum learning: Sequential textual → multimodal → textual RL stages enable synergistic skill development.
Algorithmic stabilization: PAD ensures persistent, informative training signals, overcoming RL inefficiencies inherent in sparse-reward MLLM settings.
Reward shaping: Length rewards mitigate mode collapse, and group normalization ensures optimization focuses on nontrivial samples.

5. Broader Implications for Multimodal Reasoning Research

The findings and architecture of ReVisual-R1 suggest several research and practical implications:

Foundational Role of Textual Reasoning: High-difficulty, text-only reasoning skills should be established prior to introducing multimodal data—a textual cold start is foundational rather than auxiliary.
Curriculum and Stage Ordering: Effective multimodal reasoning in MLLMs depends critically on the sequence and coverage of RL stages. Post-multimodal textual RL is essential to prevent catastrophic forgetting of linguistic competencies.
Necessity of Algorithmic Stabilizers: Techniques like PAD or similar prioritization are essential for training MLLMs with RL in sparse-reward, high-variance settings, avoiding wasted computation and ensuring sample efficiency.
Open Research Areas:
- Construction of richer multimodal reasoning datasets with fine-grained annotations and difficulty.
- Advances in adaptive advantage estimation, sample selection, and automated curriculum scheduling.
- Extension to higher complexity modalities, such as video, 3D data, or multilingual inputs.
- Further analysis of the synergy and trade-off between visual grounding and abstract reasoning capacities.

6. Summary Table: Optimization Principles in ReVisual-R1

Principle	Mathematical Expression / Mechanism
GRPO RL objective	See above for full clipped surrogate
Group-wise advantage	$\hat{A}(x, y_i)$ (normalized diff)
PAD sampling probability	Softmax over advantages ( $\tau$ )
Efficient-length reward	$R_{\text{len}}$ , penalizes verbosity
Curriculum structure	Text cold start → Multimodal RL → Text RL

7. Conclusions

ReVisual-R1 represents a principled, curriculum-driven, and algorithmically robust approach to multimodal reasoning in LLMs. Its success underscores that model capability is closely tied to systematic pre-training, RL pipeline design, and optimization strategies that explicitly mitigate RL bottlenecks such as gradient stagnation. This model and its training methodology set a reference for subsequent research targeting balanced and high-level multimodal reasoning within compute-constrained settings.

PDF Markdown Chat (Upgrade)