ReasonMap-Plus: Visual Reasoning Benchmark

Updated 5 October 2025

ReasonMap-Plus is a comprehensive benchmark for fine-grained visual reasoning, featuring multimodal transit map datasets and graded difficulty levels.
It encompasses 4,018 annotated questions from high-resolution transit maps across 30 cities, supporting tasks from basic counting to multi-hop route planning.
It integrates a multi-stage, curriculum-driven reinforcement learning framework with dense reward signals, yielding measurable improvements in spatial reasoning performance.

ReasonMap-Plus is an extended benchmark and data resource designed to advance fine-grained visual reasoning research, particularly in the context of multimodal LLMs (MLLMs) operating on structured transit maps. It builds on the original ReasonMap benchmark, offering a richer and denser set of supervision signals tailored to enable effective reinforcement learning and cold-start training for visual understanding and spatial reasoning in realistic, information-rich environments.

1. Dataset Construction and Organization

ReasonMap-Plus comprises high-resolution transit maps sourced from 30 cities across 13 countries, yielding a dataset of 4,018 questions focused on various facets of visual perception and spatial reasoning. Each map is manually annotated with three difficulty levels: easy, medium, and hard. The questions within the dataset are generated to align with these difficulty strata, ensuring a graded challenge from basic perception to multi-hop route planning.

The dataset introduces five Visual Question Answering (VQA) question types:

Global Counting: Counts total metro lines on a map.
Local Counting: (i) Counts intermediate stops between two references, (ii) Counts lines passing through a stop.
True/False: (i) Verifies spatial relations between stops, (ii) Determines if a stop is present on a given line.

Dense reward signals are achieved through an automatic question generation pipeline that utilizes underlying annotated “Metro Data.” This enables each question to yield intermediate correctness signals (format, route details, partial correctness) rather than only sparse binary success indications.

2. Dense Reward Signal Design

A distinguishing aspect of ReasonMap-Plus is the incorporation of dense supervision. Unlike traditional sparse reward settings, where an agent receives feedback only for the final answer, ReasonMap-Plus employs incremental signals in VQA tasks. These signals span correctness in output format (e.g., proper enclosure in $\boxed{}$ ), correctness in the numerical or categorical answer, and partial correctness for individual route details (such as intermediate stops, transfers, and segment associations).

This approach enables intermediate rewards for each step or partial answer generated along a reasoning chain, which is essential for effective reinforcement learning in tasks that require multi-stage inference and planning.

3. RewardMap: Multi-Stage Reinforcement Learning

RewardMap is a multi-stage reinforcement learning framework built on ReasonMap-Plus to mitigate sparse reward issues and enhance visual reasoning for MLLMs.

Difficulty-Aware Reward Function

The reward is structured as: $R = W_{\text{difficulty}} \cdot \left( R_{\text{format}} + R_{\text{correctness}} + \alpha \times R_{\text{detail}} \right)$ where:

$R_{\text{format}}$ : Reward for correct output syntax.
$R_{\text{correctness}}$ : Reward for accuracy.
$R_{\text{detail}}$ : Partial credit for route segments, stop identification, transfers.
$\alpha$ : Relative weighting for detail reward (experimentally set to 0.5).
$W_{\text{difficulty}}$ : Weighting parameter based on question and map difficulty, specified as piecewise values ( $\gamma_e,\gamma_m,\gamma_h$ for map; $\beta_0,\beta_1$ for transfers).

Multi-Stage Curriculum Design

Training is sequenced from simpler tasks (binary judgments, counting) to complex multi-step reasoning (route planning), driven by principles:

Global Curriculum: Stages ordered by increasing complexity.
Local Stochasticity: Tasks within a stage are randomly shuffled to reduce curriculum memorization.

Reinforcement learning optimization utilizes Group Relative Policy Optimization (GRPO), where for a group of outputs $G=\{y_i\}$ and rewards $\{r_i\}$ , group-centered advantage is: $\hat{A}_i = r_i - \frac{1}{K}\sum_j r_j$ and policy maximization is: $\max_\theta \mathcal{L}(\theta) = \sum_i \hat{A}_i \cdot \log \pi_\theta(y_i|x)$ This stabilizes learning in the presence of reward sparsity.

4. Detailed Reward Algorithm for Route Planning

ReasonMap-Plus employs a detail reward algorithm (Algorithm 1), which assigns partial credit for subtasks:

+2 for correct departure or arrival stop.
+4 for correct route name (zero transfers).
+1 for correct segment transitions (arrival and departure matching).
–5 if transfer count exceeds expected value.

The cumulative score is capped (typically at 10), providing a graded reward signal for partial correctness. This dense feedback is critical for training agents capable of multi-step spatial reasoning.

5. Contributions to Fine-Grained Visual Reasoning

The integration of ReasonMap-Plus’s dense rewards with RewardMap’s curriculum-driven multi-stage RL yields consistent improvements in MLLM performance. The training regime ensures that models acquire basic visual perception skills before progressing to route planning and multi-hop reasoning.

Empirical results show average gains of 3.47% across six benchmarks, including SEED-Bench-2-Plus, SpatialEval, V*Bench, HRBench, ChartQA, and MMStar. These improvements reflect enhanced capabilities in visual understanding and complex reasoning beyond the original ReasonMap benchmark.

6. Benchmarking and Broader Impact

The ReasonMap-Plus framework has been evaluated on both its own expanded dataset and a range of external benchmarks focused on spatial and visual reasoning. Its curriculum learning strategy and adaptive reward design enable more stable, high-signal optimization compared to conventional supervised fine-tuning.

A plausible implication is that dense supervision and curriculum-driven RL frameworks represent essential future directions for enhancing MLLM performance in environments characterized by long-chain reasoning and sparse reward signals. The systematic approach in ReasonMap-Plus informs both dataset construction and algorithmic design for related domains in multimodal understanding and spatial cognition.

7. Relation to Other Approaches

The framework’s use of dense reward signals and structured RL schedules distinguishes it from previous work reliant solely on sparse final-answer feedback. By leveraging automatic question generation from underlying annotated data, ReasonMap-Plus ensures extensibility and robustness in training.

In summary, ReasonMap-Plus advances the state of fine-grained visual reasoning by constructing a structured, difficulty-aware dataset and by integrating a multi-stage RL framework that exploits dense reward signals. The observable improvements in benchmark performance and the detailed reward feedback mechanisms confirm its suitability for research and practical development in multimodal spatial cognition (Feng et al., 2 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning (2025)

Follow Topic

Get notified by email when new papers are published related to ReasonMap-Plus.