Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement (2503.17352v1)

Published 21 Mar 2025 in cs.CV and cs.CL

Abstract: Recent advancements demonstrated by DeepSeek-R1 have shown that complex reasoning abilities in LLMs, including sophisticated behaviors such as self-verification and self-correction, can be achieved by RL with verifiable rewards and significantly improves model performance on challenging tasks such as AIME. Motivated by these findings, our study investigates whether similar reasoning capabilities can be successfully integrated into large vision-LLMs (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization. Initially, reasoning capabilities were distilled from pure-text R1 models by generating reasoning steps using high-quality captions of the images sourced from diverse visual datasets. Subsequently, iterative RL training further enhance reasoning skills, with each iteration's RL-improved model generating refined SFT datasets for the next round. This iterative process yielded OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrating the potential of our strategy for robust vision-language reasoning. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.

Summary

  • The paper introduces OpenVLThinker, a LVLM that employs an iterative SFT and RL strategy to progressively enhance multimodal reasoning.
  • It distills text-based reasoning from models like DeepSeek-R1 and refines it through repeated cycles of supervised fine-tuning and reinforcement learning.
  • Experimental results on MathVista, MathVerse, and MathVision benchmarks demonstrate substantial improvements in complex visual reasoning tasks.

This work introduces OpenVLThinker, a Large Vision-LLM (LVLM) developed to enhance complex multimodal reasoning capabilities through an iterative self-improvement training strategy (2503.17352). The research is motivated by the success of LLMs like DeepSeek-R1 in acquiring sophisticated reasoning skills (e.g., self-verification, self-correction) via Reinforcement Learning (RL) with verifiable rewards, particularly on challenging text-based tasks. This paper explores the transference and enhancement of such reasoning abilities within the vision-language domain.

Methodology: Iterative SFT and RL

The core methodology revolves around an iterative cycle combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). This process aims to progressively enhance the model's multimodal reasoning skills.

  1. Initial Reasoning Distillation: The process begins by distilling reasoning capabilities from proficient text-only LLMs (specifically, models related to DeepSeek-R1). This is achieved by prompting these text models to generate reasoning steps based on high-quality captions derived from images sourced from diverse visual datasets. This step effectively creates an initial dataset pairing visual context (via captions) with structured textual reasoning paths.
  2. Supervised Fine-Tuning (SFT): The LVLM is initially fine-tuned on this distilled dataset. This SFT phase aims to imbue the base LVLM with the foundational reasoning patterns observed in the text-only teacher models, adapted to a multimodal context represented by image captions.
  3. Reinforcement Learning (RL) Enhancement: Following SFT, the model undergoes RL training. This phase uses techniques analogous to those employed for DeepSeek-R1, likely involving RLHF (Reinforcement Learning from Human Feedback) principles adapted for multimodal tasks, potentially using verifiable outcomes (e.g., correctness of the final answer in a math problem) or process supervision as reward signals. The goal is to refine the model's ability to generate valid, coherent, and correct reasoning steps beyond what was learned through simple imitation in SFT.
  4. Iterative Refinement: The key innovation lies in the iterative application of these steps. The model improved by the RL phase (MRLiM_{RL_i}) is used to generate a new, potentially higher-quality SFT dataset for the subsequent iteration (i+1i+1). This involves using MRLiM_{RL_i} to generate refined reasoning chains for the training data. The next iteration then performs SFT on this new dataset (SFTi+1SFT_{i+1}) followed by another round of RL (RLi+1RL_{i+1}). This cycle (SFTRLData GenerationSFT \rightarrow RL \rightarrow \text{Data Generation}) allows the model to progressively bootstrap its own reasoning capabilities.

This iterative self-improvement loop can be represented as:

1
2
3
4
5
6
7
8
9
10
11
Initialize LVLM M_0
For iteration i = 1 to N:
  # Generate SFT data using previous model (M_{i-1}) or distilled data (for i=1)
  SFT_Data_i = GenerateReasoningData(M_{i-1}, ImageCaptions)
  # Supervised Fine-Tuning
  M_{SFT_i} = SFT(M_{i-1}, SFT_Data_i)
  # Reinforcement Learning
  M_{RL_i} = RL(M_{SFT_i}, RewardSignal)
  # Update model for next iteration
  M_i = M_{RL_i}
Output: OpenVLThinker = M_N

Implementation and Training Details

While the specific base LVLM architecture is not detailed in the abstract, the process implies starting with a pre-trained LVLM. The training leverages diverse visual datasets for generating the initial captions used in the distillation phase. The iterative process relies on generating lightweight SFT datasets in each round, suggesting that the scale of generated data per iteration might be manageable. The RL phase requires a mechanism for providing verifiable rewards, which, in the context of tasks like visual math problems, could involve checking the final numerical answer or potentially comparing generated reasoning steps against ground truth solutions or preferences. The target benchmarks – MathVista, MathVerse, and MathVision – indicate a focus on quantitative and structured reasoning involving visual elements, necessitating capabilities beyond standard VQA or captioning.

Experimental Results and Evaluation

The primary outcome reported is the consistently improved reasoning performance of OpenVLThinker across the challenging multimodal reasoning benchmarks: MathVista, MathVerse, and MathVision. These benchmarks typically require multi-step logical deduction, mathematical calculation, and grounding of abstract concepts in visual information. The success on these benchmarks validates the effectiveness of the proposed iterative SFT+RL strategy for enhancing complex reasoning in LVLMs. The paper positions OpenVLThinker as demonstrating the potential of this self-improvement methodology for developing more robust vision-language reasoning systems. The magnitude of the improvements or specific scores on these benchmarks are not detailed in the abstract but are presumably available in the full paper.

Practical Implications and Availability

The practical implication of this research is a potential pathway for developing LVLMs with significantly enhanced abilities to tackle complex tasks requiring step-by-step reasoning grounded in visual input. This includes applications in:

  • Educational Tools: Solving visual math or physics problems.
  • Data Analysis: Interpreting charts and diagrams that require calculations or logical inference.
  • Instruction Following: Executing complex instructions involving objects and relations depicted in an image.
  • Accessibility: Describing complex visual scenes or documents in a structured, logical manner.

The iterative nature of the training, while powerful, likely entails significant computational cost, especially the RL phase and the repeated data generation/SFT cycles. Scaling this approach to larger models or datasets would require substantial resources. The reliance on high-quality captions for initial distillation and potentially well-defined reward functions for RL are practical considerations for implementation.

The authors have made the code, model, and data available, facilitating reproducibility and further research in this direction: https://github.com/yihedeng9/OpenVLThinker.

Conclusion

OpenVLThinker represents an investigation into enhancing complex reasoning in LVLMs by adapting and extending techniques proven successful in text-only LLMs. The proposed iterative self-improvement cycle, alternating between SFT on model-generated data and RL refinement using verifiable rewards, demonstrates a promising approach. The reported performance gains on mathematically oriented visual reasoning benchmarks suggest that this methodology can effectively instill more sophisticated, step-by-step reasoning capabilities into vision-LLMs.