- The paper introduces OpenVLThinker, a LVLM that employs an iterative SFT and RL strategy to progressively enhance multimodal reasoning.
- It distills text-based reasoning from models like DeepSeek-R1 and refines it through repeated cycles of supervised fine-tuning and reinforcement learning.
- Experimental results on MathVista, MathVerse, and MathVision benchmarks demonstrate substantial improvements in complex visual reasoning tasks.
This work introduces OpenVLThinker, a Large Vision-LLM (LVLM) developed to enhance complex multimodal reasoning capabilities through an iterative self-improvement training strategy (2503.17352). The research is motivated by the success of LLMs like DeepSeek-R1 in acquiring sophisticated reasoning skills (e.g., self-verification, self-correction) via Reinforcement Learning (RL) with verifiable rewards, particularly on challenging text-based tasks. This paper explores the transference and enhancement of such reasoning abilities within the vision-language domain.
Methodology: Iterative SFT and RL
The core methodology revolves around an iterative cycle combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). This process aims to progressively enhance the model's multimodal reasoning skills.
- Initial Reasoning Distillation: The process begins by distilling reasoning capabilities from proficient text-only LLMs (specifically, models related to DeepSeek-R1). This is achieved by prompting these text models to generate reasoning steps based on high-quality captions derived from images sourced from diverse visual datasets. This step effectively creates an initial dataset pairing visual context (via captions) with structured textual reasoning paths.
- Supervised Fine-Tuning (SFT): The LVLM is initially fine-tuned on this distilled dataset. This SFT phase aims to imbue the base LVLM with the foundational reasoning patterns observed in the text-only teacher models, adapted to a multimodal context represented by image captions.
- Reinforcement Learning (RL) Enhancement: Following SFT, the model undergoes RL training. This phase uses techniques analogous to those employed for DeepSeek-R1, likely involving RLHF (Reinforcement Learning from Human Feedback) principles adapted for multimodal tasks, potentially using verifiable outcomes (e.g., correctness of the final answer in a math problem) or process supervision as reward signals. The goal is to refine the model's ability to generate valid, coherent, and correct reasoning steps beyond what was learned through simple imitation in SFT.
- Iterative Refinement: The key innovation lies in the iterative application of these steps. The model improved by the RL phase (MRLi) is used to generate a new, potentially higher-quality SFT dataset for the subsequent iteration (i+1). This involves using MRLi to generate refined reasoning chains for the training data. The next iteration then performs SFT on this new dataset (SFTi+1) followed by another round of RL (RLi+1). This cycle (SFT→RL→Data Generation) allows the model to progressively bootstrap its own reasoning capabilities.
This iterative self-improvement loop can be represented as:
1
2
3
4
5
6
7
8
9
10
11
|
Initialize LVLM M_0
For iteration i = 1 to N:
# Generate SFT data using previous model (M_{i-1}) or distilled data (for i=1)
SFT_Data_i = GenerateReasoningData(M_{i-1}, ImageCaptions)
# Supervised Fine-Tuning
M_{SFT_i} = SFT(M_{i-1}, SFT_Data_i)
# Reinforcement Learning
M_{RL_i} = RL(M_{SFT_i}, RewardSignal)
# Update model for next iteration
M_i = M_{RL_i}
Output: OpenVLThinker = M_N |
Implementation and Training Details
While the specific base LVLM architecture is not detailed in the abstract, the process implies starting with a pre-trained LVLM. The training leverages diverse visual datasets for generating the initial captions used in the distillation phase. The iterative process relies on generating lightweight SFT datasets in each round, suggesting that the scale of generated data per iteration might be manageable. The RL phase requires a mechanism for providing verifiable rewards, which, in the context of tasks like visual math problems, could involve checking the final numerical answer or potentially comparing generated reasoning steps against ground truth solutions or preferences. The target benchmarks – MathVista, MathVerse, and MathVision – indicate a focus on quantitative and structured reasoning involving visual elements, necessitating capabilities beyond standard VQA or captioning.
Experimental Results and Evaluation
The primary outcome reported is the consistently improved reasoning performance of OpenVLThinker across the challenging multimodal reasoning benchmarks: MathVista, MathVerse, and MathVision. These benchmarks typically require multi-step logical deduction, mathematical calculation, and grounding of abstract concepts in visual information. The success on these benchmarks validates the effectiveness of the proposed iterative SFT+RL strategy for enhancing complex reasoning in LVLMs. The paper positions OpenVLThinker as demonstrating the potential of this self-improvement methodology for developing more robust vision-language reasoning systems. The magnitude of the improvements or specific scores on these benchmarks are not detailed in the abstract but are presumably available in the full paper.
Practical Implications and Availability
The practical implication of this research is a potential pathway for developing LVLMs with significantly enhanced abilities to tackle complex tasks requiring step-by-step reasoning grounded in visual input. This includes applications in:
- Educational Tools: Solving visual math or physics problems.
- Data Analysis: Interpreting charts and diagrams that require calculations or logical inference.
- Instruction Following: Executing complex instructions involving objects and relations depicted in an image.
- Accessibility: Describing complex visual scenes or documents in a structured, logical manner.
The iterative nature of the training, while powerful, likely entails significant computational cost, especially the RL phase and the repeated data generation/SFT cycles. Scaling this approach to larger models or datasets would require substantial resources. The reliance on high-quality captions for initial distillation and potentially well-defined reward functions for RL are practical considerations for implementation.
The authors have made the code, model, and data available, facilitating reproducibility and further research in this direction: https://github.com/yihedeng9/OpenVLThinker.
Conclusion
OpenVLThinker represents an investigation into enhancing complex reasoning in LVLMs by adapting and extending techniques proven successful in text-only LLMs. The proposed iterative self-improvement cycle, alternating between SFT on model-generated data and RL refinement using verifiable rewards, demonstrates a promising approach. The reported performance gains on mathematically oriented visual reasoning benchmarks suggest that this methodology can effectively instill more sophisticated, step-by-step reasoning capabilities into vision-LLMs.