Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning (2505.14677v1)

Published 20 May 2025 in cs.CV

Abstract: Learning general-purpose reasoning capabilities has long been a challenging problem in AI. Recent research in LLMs, such as DeepSeek-R1, has shown that reinforcement learning techniques like GRPO can enable pre-trained LLMs to develop reasoning capabilities using simple question-answer pairs. In this paper, we aim to train visual LLMs (VLMs) to perform reasoning on image data through reinforcement learning and visual question-answer pairs, without any explicit chain-of-thought (CoT) supervision. Our findings indicate that simply applying reinforcement learning to a VLM -- by prompting the model to produce a reasoning chain before providing an answer -- can lead the model to develop shortcuts from easy questions, thereby reducing its ability to generalize across unseen data distributions. We argue that the key to mitigating shortcut learning is to encourage the model to interpret images prior to reasoning. Therefore, we train the model to adhere to a caption-reason-answer output format: initially generating a detailed caption for an image, followed by constructing an extensive reasoning chain. When trained on 273K CoT-free visual question-answer pairs and using only reinforcement learning, our model, named Visionary-R1, outperforms strong multimodal models, such as GPT-4o, Claude3.5-Sonnet, and Gemini-1.5-Pro, on multiple visual reasoning benchmarks.

Summary

Overview of Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

The paper entitled "Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning" presents a novel approach to training visual LLMs (VLMs) for reasoning tasks using reinforcement learning (RL). The focus is on addressing the critical issue of shortcut learning, which occurs when models exploit patterns in easy questions without genuinely engaging in deep reasoning about visual data. This phenomenon undermines the model's ability to generalize to variants of unseen data distributions, posing significant challenges for applications requiring robust general-purpose reasoning capabilities.

Main Contributions

The paper identifies a common failure mode in VLMs trained with conventional RL techniques, such as GRPO, where the model achieves correct answers on simple questions by bypassing image analysis, thus impairing its reasoning skills for more complex questions requiring intricate visual understanding. To counteract this, the authors propose Visionary-R1, a reinforcement learning framework that integrates a structured caption-reason-answer output format into the learning process. This format mandates that the model first generates a comprehensive caption to interpret the image data thoroughly before engaging in reasoning and providing final responses. This method constrains the model to engage actively with image features, systematically alleviating shortcut learning.

Methodology

Visionary-R1 employs a reward mechanism that combines accuracy rewards, format rewards, and caption rewards. The caption reward leverages reinforcement learning from AI feedback, evaluating the informativeness of generated captions by assessing their utility for answering questions when isolated. This reward structure is framed in the policy optimization process grounded in Group Relative Policy Optimization (GRPO), adapted to reinforce stepwise reasoning over merely shortcut solutions. Crucial to this setup is the use of a cosine annealing strategy for the KL divergence penalty, enhancing training stability and promoting the generation of longer reasoning sequences.

Experimental Results

The authors report evaluations of Visionary-R1 on diverse visual reasoning benchmarks, including MathVista, MathVision, MMStar, and MMBench. The results indicate that Visionary-R1 surpasses both supervised fine-tuning models and existing RL-trained alternatives, as well as some high-profile proprietary models like GPT-4o and Claude3.5-Sonnet in certain tests. These benchmarks encompass a range of visual formats, underscoring the advantages of caption-integrated RL training in fostering generalizable reasoning capabilities across multimodal tasks.

Implications and Future Directions

The findings present compelling evidence on the efficacy of enforcing structured image analysis before reasoning in mitigating shortcuts. Practically, this approach holds significant promise for advancing applications in fields such as automated scene understanding, scientific analysis, and document intelligence. Theoretically, Visionary-R1 stimulates discussion on the role of structured input processing in alleviating limitations inherent in RL frameworks for complex AI tasks.

Future research could explore scaling the Visionary-R1 approach to larger models and datasets, potentially amplifying its effectiveness across broader scopes of reasoning challenges. Additionally, this work opens avenues for refining RL strategies to specifically tailor rewards to diverse modalities, further bolstering the universal applicability of reasoning frameworks in AI.

In summary, Visionary-R1 presents a substantial enhancement to visual reasoning models by introducing a caption-driven reinforcement learning framework, which successfully mitigates shortcut learning and significantly improves the versatility and generalization of AI's reasoning capabilities across complex visual tasks.

Tweets

https://twitter.com/SharonYixuanLi/status/1940777097808826622

https://twitter.com/JiaerXia/status/1925388149825618308