Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 86 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning (2508.21113v1)

Published 28 Aug 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Multimodal LLMs (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.

Summary

  • The paper introduces R-4B, which employs an adaptive auto-thinking mechanism that dynamically switches between reasoning and direct response modes to reduce computational overhead.
  • It uses bi-mode annealing and reinforcement learning to train on diverse multimodal data, achieving state-of-the-art results on benchmarks like MMMU, MMStar, and MathVerse-Vision.
  • Empirical evaluations show that R-4B-RL outperforms comparable 4B-scale models by balancing token efficiency with enhanced reasoning for complex visual and text tasks.

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforcement Learning

Introduction

The R-4B model addresses a critical challenge in Multimodal LLMs (MLLMs): the inefficiency of always-on step-by-step reasoning for queries that do not require complex deduction. While explicit reasoning blocks have improved performance on tasks demanding deep inference, they introduce unnecessary computational overhead for simple queries. R-4B introduces an adaptive auto-thinking mechanism, enabling the model to dynamically select between reasoning and direct response modes based on input complexity. This essay provides a technical analysis of the R-4B architecture, its bi-mode annealing and policy optimization strategies, and its empirical performance across multimodal benchmarks.

Model Architecture and Pre-Training

R-4B is constructed with a SigLIP2-So400m visual encoder, an MLP projector for modality alignment, and a Qwen3-4B LLM backbone. The pre-training pipeline consists of three stages:

  1. MLP Warmup: The MLP projector is trained on image-caption pairs with frozen ViT and LLM, establishing initial cross-modal alignment.
  2. Vision-Language Alignment: The ViT is unfrozen and trained with diverse multimodal data, enhancing visual domain generalization.
  3. Joint Multimodal Pre-Training: All components are jointly optimized on 145B tokens spanning OCR, grounding, mathematical reasoning, and structured data. Non-thinking loss masking is applied to preserve the LLM's reasoning capabilities.

This staged approach ensures robust multimodal understanding and efficient integration of visual and textual modalities.

Bi-Mode Annealing: Data Curation and Training

The bi-mode annealing process is central to R-4B's dual-mode capability. Data is partitioned into reasoning-intensive and non-reasoning samples using a heuristic-driven strategy:

  • Difficulty-Based Heuristic: For subjective queries, prompt engineering with a strong MLLM annotator (Qwen2.5-32B-VL) assesses reasoning requirements.
  • Performance-Based Heuristic: For objective queries, offline hard mining identifies samples that consistently elude correct answers, labeling them as reasoning-intensive.

Both data types are formatted with unified instruction-following structures, using > tags for structural consistency. The annealing stage mixes these datasets, producing R-4B-Base, which is proficient in both reasoning and direct answering across general domains.

Bi-Mode Policy Optimization (BPO): Reinforcement Learning for Auto-Thinking

Despite dual-mode training, R-4B-Base exhibits a bias toward non-thinking responses, even for complex queries—a phenomenon termed "thinking atrophy." To address this, R-4B employs Bi-mode Policy Optimization (BPO), a reinforcement learning algorithm that:

  • Bi-Mode Rollouts: For each query, the model generates both thinking and non-thinking responses, conditioned on special tokens.
  • Reward Signal: A simple, rule-based reward derived from mathematical tasks is used, avoiding complex reward engineering and manual annotation.
  • Policy Objective: The BPO objective maximizes the expected advantage of the optimal mode per query, with KL regularization to maintain policy stability.

This approach prevents mode collapse and enables the model to learn an adaptive policy for mode selection, resulting in R-4B-RL with enhanced auto-thinking capabilities.

Empirical Evaluation

R-4B-RL is evaluated on 25 multimodal benchmarks, including MMMU, MMStar, MathVerse-Vision, LogicVista, and CharXIV. Key findings include:

  • General Visual QA: R-4B-RL achieves state-of-the-art scores on MMMUval (68.1%) and MMStar (73.1%), outperforming comparable models and matching larger models like Kimi-VL-A3B-Thinking-2506 (16B).
  • Document and Chart Understanding: R-4B-RL leads on AI2D (86.2%) and CharXIV-RQ (56.8%), demonstrating superior reasoning over structured visual content.
  • Visual Perception and Counting: R-4B-Base sets the highest score on CountBench (92.6%), and R-4B-RL matches top performance on BLINKval (56.3%).
  • Complex Reasoning: R-4B-RL dominates MathVerse-Vision (64.9%), OlympiadBench (49.6%), LogicVista (59.1%), and DynaMath (39.5%), consistently outperforming other 4B-scale models.

Token Efficiency

Auto-thinking mode in R-4B-RL achieves a favorable trade-off between performance and computational cost. For simple tasks (e.g., OCRBench), token output is comparable to non-thinking mode, while for complex reasoning tasks (e.g., MathVista), token output approaches that of full thinking mode, with corresponding performance gains.

Ablation and Analysis

Ablation studies confirm that mixed-data bi-mode annealing yields superior generalization and prevents catastrophic forgetting. BPO's learning dynamics show rapid adaptation of thinking trigger rates on reasoning benchmarks, with minimal increase on non-reasoning tasks. RL consistently improves both direct response and reasoning capabilities, with R-4B-RL outperforming R-4B-Base in all modes.

Vanilla GRPO suffers from a "thinking preference dilemma," where the policy collapses to a single mode. BPO's deterministic bi-mode rollouts and balanced reward structure effectively mitigate this issue, ensuring robust adaptive reasoning.

Practical and Theoretical Implications

R-4B demonstrates that content-aware auto-thinking can be efficiently realized in MLLMs without complex reward engineering or manual data annotation. The bi-mode annealing and BPO framework generalize across domains, enabling resource-efficient deployment in real-world applications where query complexity varies. The model's ability to match or surpass larger models in reasoning tasks with lower computational cost has direct implications for scalable multimodal AI systems.

Theoretically, R-4B's approach suggests that explicit dual-mode training combined with balanced policy optimization is sufficient to induce adaptive reasoning, challenging the necessity of intricate reward functions or domain-specific heuristics. The universality of the BPO reward signal across non-mathematical domains warrants further investigation into minimal supervision strategies for adaptive reasoning.

Future Directions

Potential future developments include:

  • Extending BPO to more granular reasoning modes (e.g., multi-step, chain-of-thought, or reflection).
  • Investigating transferability of the auto-thinking policy to unseen modalities and tasks.
  • Exploring lightweight reward signals for other forms of adaptive behavior (e.g., auto-grounding, auto-explanation).
  • Scaling the bi-mode annealing and BPO framework to larger models and more diverse datasets.

Conclusion

R-4B introduces a principled and efficient solution to the auto-thinking challenge in MLLMs, leveraging bi-mode annealing and reinforcement learning to achieve adaptive reasoning with minimal computational overhead. Its empirical performance establishes new standards for 4B-scale models, and its methodological innovations provide a foundation for future research in content-aware multimodal reasoning.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 14 posts and received 167 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com