RewardMap: Multi-Stage RL for Visual Reasoning
- RewardMap is a multi-stage reinforcement learning framework that overcomes sparse rewards by integrating dense VQA tasks, difficulty-aware rewards, and a structured curriculum.
- It leverages the ReasonMap-Plus dataset with high-resolution transit maps and detailed annotations to provide granular supervision and improve visual understanding.
- Empirical evaluations show significant gains, including up to 31.7% accuracy improvement, validating its enhanced performance on spatial and visual reasoning benchmarks.
RewardMap is a multi-stage reinforcement learning framework designed to address the problem of sparse rewards in fine-grained visual reasoning tasks for multimodal LLMs (MLLMs), particularly in challenging spatial and structured environments such as high-resolution transit maps. Its key innovations include the integration of a difficulty-aware reward function and a curriculum-based multi-stage RL scheme, supported by the ReasonMap-Plus dataset that provides dense Visual Question Answering (VQA) supervision for effective cold-start training. Empirical evaluations demonstrate consistent and significant improvements in visual understanding and reasoning across a wide set of benchmarks.
1. Motivation and Problem Setting
Fine-grained visual reasoning on structured visual artifacts (e.g., transit maps) remains challenging for advanced MLLMs. In reinforcement learning applied to such tasks, reward signals are typically sparse: feedback is often idiosyncratic, only provided at the successful completion of a multi-hop visual reasoning chain, which results in unstable optimization, hindered exploration, and slow or unreliable skill acquisition. The standard Supervised Fine-Tuning (SFT) pipeline—while providing dense supervision—falls short when scaling to complex, long-chain reasoning, leading to poor performance on spatial reasoning and decision-making that requires detailed perception of visual structure.
RewardMap is designed to resolve these limitations by (1) augmenting the available supervision signal through dense, detail-level VQA tasks; (2) introducing a structured, difficulty-aware reward function that delivers partial credit for correct sub-decisions; and (3) organizing the training into a curriculum spanning from easy perceptual to difficult planning tasks.
2. ReasonMap-Plus: Enabling Dense Supervision
RewardMap is enabled by ReasonMap-Plus, an extension of the ReasonMap benchmark, which systematically increases the density and granularity of reward signals available during model training:
- Data Construction: ReasonMap-Plus uses high-resolution transit maps from 30 cities over 13 countries, inheriting detailed line-stop annotations.
- Question–Answer Generation: Beyond complex planning, it extends to five types of VQA tasks, such as Global Counting, Local Counting, and True/False queries—generated via rule-based templates and map annotations. Answers are automatically derived from the structured underlying data.
- Quality Assurance: All auto-generated QA pairs undergo human review to confirm diversity and difficulty calibration, with each map labeled as easy, medium, or hard.
This design yields a data continuum fully covering—from simple perception to multi-hop planning—thus providing denser and more informative rewards, especially critical for initial RL cold-start phases.
3. Difficulty-Aware Reward Function
RewardMap’s reward function is explicitly constructed to address the sparsity and granularity of feedback:
- Reward Terms:
- — rewards correct answer formatting.
- — rewards exact correctness of the answer (e.g., exact match).
- — rewards fine-grained partial successes, such as correct identification of origin, destination, line names, route segments, or intermediate stops.
- Difficulty Scaling: Rewards are multiplied by a composite difficulty weight accounting for map complexity and question/task difficulty. Concretely:
where controls the emphasize on partial credit ( in experiments).
This structure enables the reward signal to provide rich, graded support for partial progress, directly mitigating sparsity and stabilizing RL optimization.
4. Multi-Stage Reinforcement Learning and Curriculum
The training methodology of RewardMap is a multi-stage reinforcement learning pipeline with curriculum progression:
- Stage Segmentation: Training begins with dense, low-difficulty tasks from ReasonMap-Plus (perception, counting, binary decisions). As performance stabilizes, more challenging planning tasks from the original ReasonMap are gradually introduced.
- Local Stochasticity: Within each stage, a local data shuffle randomizes sample order to avoid curriculum overfitting.
- Policy Optimization: Uses Group Relative Policy Optimization (GRPO). For an input and answer group with corresponding returns , the group-centered advantage is:
and the RL loss is:
This variance-reduced advantage encoding helps stabilize learning even under reward sparsity.
The multi-stage curriculum ensures that models acquire basic perceptual skills under dense supervision before being confronted with the sparse and high-complexity rewards characteristic of structured planning tasks.
5. Empirical Evaluation and Results
RewardMap is empirically evaluated on both ReasonMap, ReasonMap-Plus, and six additional fine-grained spatial and visual reasoning benchmarks.
- ReasonMap: RewardMap-trained models outperform open-source competitors (including Qwen2.5-VL-72B-Instruct) and approach the performance of closed-source Seed1.5-VL.
- ReasonMap-Plus: Models achieve weighted accuracy improvements up to 31.7%, with detail and curriculum-based RL components contributing to significant performance gains over various SFT+RL baselines.
- Generalization: Across six external benchmarks (SEED-Bench-2-Plus, SpatialEval, V*Bench, HRBench, ChartQA, MMStar), RewardMap yields an average improvement of 3.47%; the largest margin is observed on SpatialEval (13.51%).
- Ablation: Each component—detail reward, curriculum, and dense RL at cold-start—contributes to improved stability and accuracy.
These results validate both the dense, detail-aware reward and curriculum design: RL trajectories in RewardMap maintain higher, less-variant advantage signals even in late training stages when only sparse rewards are available.
6. Implications and Broader Context
RewardMap’s approach demonstrates that fine-grained visual reasoning with MLLMs can be substantially improved by engineering the reward structure and training schedule. In particular, using partial-credit and difficulty-scaled rewards, combined with curriculum learning, leads to more consistent skill acquisition, especially in domains where dense supervision is rarely available for the most complex tasks.
A plausible implication is that this framework can serve as a blueprint for future MLLM training paradigms in other structured visual or multimodal domains—such as chart parsing, document layout understanding, or complex navigational planning—where sparse high-level rewards and complex reasoning chains are endemic.
The success of RewardMap further supports a broader shift in multimodal RL research: significant performance gains in complex visual reasoning can be achieved not only through model size or backbone innovations, but critically through careful reward engineering and staged learning protocols.
7. Conclusions
RewardMap is a principled multi-stage reinforcement learning framework that effectively addresses sparse reward challenges in fine-grained visual reasoning for MLLMs. By integrating a difficulty-aware, detail-focused reward design with curriculum-based multi-stage RL—leveraging the ReasonMap-Plus dataset—it enables improved learning of both basic perceptual and advanced reasoning capabilities. Demonstrated improvements generalize across a spectrum of visual benchmarks, indicating its robustness and broad applicability in research and real-world multimodal reasoning tasks (Feng et al., 2 Oct 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free