Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models (2505.16854v2)

Published 22 May 2025 in cs.AI and cs.CV

Abstract: Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-LLMs (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches. Our code is available at https://github.com/kokolerk/TON.

Summary

The paper introduces TON, a two-stage framework that trains vision-language models to selectively engage in reasoning, leading to improved efficiency.
The SFT stage employs thought dropout to mimic human selective cognition by replacing detailed reasoning with empty prompts while preserving correct answers.
The RL stage, using GRPO, refines the model's policy by rewarding succinct responses, significantly reducing output length without compromising accuracy.

This paper, "Think or Not? Selective Reasoning via Reinforcement Learning for Vision-LLMs" (2505.16854), addresses the inefficiency of existing vision-LLMs (VLMs) that use reinforcement learning (RL) to enhance reasoning. Methods like Group Relative Policy Optimization (GRPO) often require generating full reasoning traces before answering, leading to high computational costs and long completion lengths, even for simple questions.

Inspired by human cognitive behavior where thinking effort is adjusted based on task difficulty, the authors propose to enable VLMs to decide when reasoning is necessary, not just how to reason. They present empirical evidence showing that a significant percentage of questions can be answered correctly even without explicit reasoning, yet standard prompting doesn't effectively teach models to skip thoughts.

To achieve this selective reasoning, the authors introduce TON (Think-or-Not), a two-stage training framework:

Supervised Fine-Tuning (SFT) Stage with Thought Dropout: The initial stage involves fine-tuning the VLM on instruction data, typically formatted with explicit thinking steps (> reasoning </think><answer> answer </answer>). To introduce the concept of skipping thoughts, TON applies a "thought dropout" operation during SFT. This involves randomly replacing the high-quality reasoning traces with an empty thought placeholder (<think>\n\n), while keeping the correct answer. This explicitly trains the model to follow a format that includes the possibility of an empty thought, serving as a cold start for selective reasoning. To obtain the high-quality thought data for SFT without relying on external models, the authors use a "reverse thinking" strategy, prompting the base VLM with the image, question, and ground-truth answer to generate the intermediate thought process.

def thought_dropout(thought, dropout_prob):
    if random.random() < dropout_prob:
        thought = "\n\n"
    return thought

Reinforcement Learning (RL) Stage via GRPO: After the SFT stage familiarizes the model with the "think or not" format, the second stage uses GRPO to refine the model's policy. GRPO samples multiple responses for a given image and query, evaluates them based on a reward function (primarily task outcome, like correct answer or action), and updates the policy to favor responses with higher relative advantages within the sampled group. TON leverages the ability, learned in the SFT stage, to generate responses with empty thoughts. This significantly increases the diversity of sampled responses compared to vanilla GRPO (which always produces full thoughts), allowing the model to explore and learn through RL when skipping thoughts is beneficial (e.g., leads to a higher reward).

The reward function combines format rewards (checking if the <think> and <answer> or <action> tags are present) with outcome rewards ( $r_o$ ). Outcome rewards depend on the task and can be:

Discrete Matching ( $r_d$ ): Binary reward (1 for correct, 0 for incorrect) for tasks with deterministic outputs like counting or mathematical answers.
Continuous Matching ( $r_c$ ): Binary reward (1 if a predicted point falls within a ground-truth box or within a distance threshold of a ground-truth point, 0 otherwise) for tasks like spatial grounding in GUI navigation. The total outcome reward $r_o = r_d + r_c$ .

The authors evaluate TON using Qwen-2.5-VL-Instruct models (3B and 7B sizes) on diverse vision-language tasks: CLEVR/Super-CLEVR (counting), GeoQA (math reasoning), and AITZ (mobile agent navigation).

Key experimental findings include:

Performance and Efficiency (Q1): TON achieves substantial reductions in average completion length compared to vanilla GRPO, up to 90% on GeoQA (7B model), while maintaining or even improving performance. For example, on GeoQA 7B, TON increased accuracy by 17% while reducing length by 90%. On AITZ, TON reduced task-level output length from over 3K tokens to around 0.9K (70% saving) with comparable task accuracy. The reduced length also leads to shorter RL training times.
Skip Thought Ratio Analysis (Q2): During TON training, the ratio of generated samples with skipped thoughts increases progressively as the training reward improves. This demonstrates that the model learns to adaptively bypass unnecessary reasoning. Ablations with different initial thought dropout probabilities during SFT (20%, 50%, 80%) show that while all settings exhibit increasing skip rates, the rate of increase varies, suggesting dynamic optimization potential.
Significance of SFT (Q3): Attempting to achieve selective reasoning purely through prompting (a "hybrid-thought" prompt encouraging skipping) without the SFT thought dropout stage was ineffective. Models trained only with the hybrid prompt and GRPO rarely generated skip-thought outputs, defaulting to verbose reasoning. This highlights the critical role of the SFT stage in explicitly introducing and reinforcing the format-following behavior necessary for selective reasoning.

Qualitative examples further illustrate TON's adaptive behavior, showing it outputs empty thoughts for simple CLEVR counting questions but engages in detailed reasoning for more complex ones involving occlusion, unlike GRPO which generates verbose thoughts for both. Similarly, in AITZ, TON skips thought steps when unnecessary, while GRPO always generates them.

The paper concludes that TON successfully trains VLMs to selectively reason, significantly improving efficiency without sacrificing performance. This capability is learned by combining format-following introduced via thought dropout in SFT with reward-guided exploration in GRPO. The findings suggest that teaching models when to think is a distinct, trainable skill crucial for developing more efficient and human-like AI systems.

Limitations include the evaluation being limited to smaller models (3B/7B) and open-source VLMs.

The broader impact lies in promoting more efficient VLM reasoning through RL, suggesting a path for flexibly injecting prior knowledge (like output formats) through SFT before RL, which could inspire future research directions in both multimodal AI and RL.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - kokolerk/TON (16 stars)

Tweets

https://twitter.com/KevinQHLin/status/1925799774835392871

https://twitter.com/KevinQHLin/status/1925750301262229667

https://twitter.com/arxivsanitybot/status/1926107692906463416

YouTube

Show All Videos