M2-Reasoning-7B: Multimodal Abstract & Spatial Reasoning

Updated 18 July 2025

M2-Reasoning-7B is a multimodal language model designed to integrate text, image, and video data for state-of-the-art abstract and spatial reasoning.
It employs a two-phase training strategy with a curated dataset and dynamic multi-task optimization to enhance chain-of-thought synthesis and spatial understanding.
Benchmark results on tasks like CV-Bench and VSI-Bench demonstrate significant improvements in both logical deduction and spatial evaluation capabilities.

M2-Reasoning-7B is a multimodal LLM (MLLM) engineered to achieve state-of-the-art (SOTA) performance in both general abstract reasoning and dynamic spatial reasoning tasks. It is distinguished by its integration of a rigorously curated dataset combining text, image, and video-based reasoning trajectories, and by employing a dynamic multi-task optimization strategy that tailors rewards and training order to the requirements of diverse reasoning domains. The model advances capabilities in mathematical, logical, and perceptual reasoning, addressing longstanding challenges in the field by unifying chain-of-thought synthesis, structural answer validation, and granular spatial understanding within a single 7B-parameter framework (AI et al., 11 Jul 2025).

1. Architecture and Integration of Modalities

M2-Reasoning-7B is based on the Qwen2.5-7B transformer LLM, augmented with a native-resolution vision encoder. Unlike conventional designs, it omits the standard MLP projector between the vision encoder and the LLM, reducing integration overhead and promoting more direct fusion of visual and linguistic representations. The multimodal backbone allows for both text and high-resolution visual input (still images and video frames), enabling a broad range of reasoning types.

The architecture is optimized for joint training and inference over multimodal data. It is designed to process and reason about both static spatial scenes and dynamic object interactions, with special provisions for encoding and utilizing instance segmentation, scene depth, and spatiotemporal relationships extracted from video segments.

2. Data Pipeline and Quality Assurance

A central innovation of M2-Reasoning-7B is its two-phase data curation and synthesis pipeline producing 294.2K high-quality samples:

Cold-start Fine-Tuning Data (168K samples): Initial training data consist of high-quality, multimodal chain-of-thought (CoT) reasoning samples spanning general and mathematical reasoning. These samples are synthesized via existing vision-language reasoning models and subsequently assessed for accuracy and logical coherence.
RLVR (Reinforcement Learning with Verifiable Rewards) Data (126.2K samples): A secondary set targets challenging reasoning scenarios, containing 100K general reasoning tasks and 26.2K spatial reasoning tasks (further split into 18.7K image-based and 7.5K video-based cases). These are carefully annotated through processes including depth estimation, instance segmentation, and explicit 3D geometric labeling.

Both autoregressive and spatially-referenced questions and answers are included. Automated model-based and human-based assessments are used to validate the correctness, structural organization, and verification richness of each reasoning trajectory. For spatial reasoning tasks, question–answer pairs are generated to probe capabilities such as object counting, relative distance inference, and appearance ordering.

3. Training Strategy and Dynamic Multi-Task Optimization

M2-Reasoning-7B introduces a dynamic multi-task training methodology with two primary stages:

Stage 1 (Supervised Cold-Start Phase): The model is first fine-tuned using the cold-start data to activate foundational reasoning abilities and to standardize the response format.
Stage 2 (RLVR Phase): Reinforcement learning using a form of generalized reward policy optimization (GRPO) further refines the model. Curriculum sampling is employed, ordering samples from simple to complex by a prompt difficulty score, and dynamic adjustment of hyper-parameters provides stability across diverse task types.

Rewards are specifically formulated for each task domain. For general abstract reasoning, an accuracy reward is calculated by exact matching (or by using automated tools such as Math-Verify). For spatial reasoning, a continuous-valued Exponential Decay Numeric Matching (EDNM) reward is used:

$R^{\text{EDNM}}(x) = \gamma \cdot \exp\left(-\lambda \cdot \frac{ |x - x_{\text{gt}}| }{ |x_{\text{gt}}| + \epsilon } \right)$

where $\gamma$ and $\lambda$ regulate the reward decay, $x$ is the predicted value, and $x_{\text{gt}}$ is the ground truth.

Dynamic multi-tasking is realized not only through the separate reward formulations but also by curriculum-driven data ordering, tailored advantage normalization,

$\hat{A}_i = \frac{R_i - \mathrm{mean}(\{R_i\}_{i=1}^G)}{\mathrm{std}(\{R_i\}_{i=1}^G)}$

and cosine-annealed KL-penalty schedules to promote robust learning on difficult cases.

4. Logically Coherent Reasoning Trajectories

A defining feature of M2-Reasoning-7B is the generation and rigorous assessment of logically coherent chain-of-thought trajectories. For each reasoning instance, the model is trained and validated not only on final answer correctness but also on the organization, step-by-step validation frequency, and the richness of verification strategies within the solution trajectory.

In general reasoning tasks, this approach ensures multi-step explanations exhibit structural quality and manage cognitive load—a critical factor for advanced tasks such as mathematical theorem proving or multi-stage logical deduction. For spatial reasoning, synthesized tasks demand detailed consideration of object relations, relative positioning, and event ordering over time.

5. Task-Specific Rewards and Fine-Grained Supervision

Reward functions are adjusted to the properties of individual task families. For general reasoning challenges:

$R^{\text{gen}}_i = \mathbb{I}(o_i, \text{ground truth})$

using an indicator function for exact match evaluation.

For tasks requiring numeric prediction (e.g., spatial distance), the EDNM function (see above) replaces discrete matching. The RLVR data regime utilizes a blend of automated rule-based scoring (for multiple-choice or symbolic tasks) and continuous decay for perceptual estimates.

These tailored rewards, coupled with dynamic curriculum strategies and stepwise optimization, are critical to resolving the conflicts between abstract and spatial data distributions—one of the principal hurdles in unified MLLM training.

6. Benchmark Results and Comparative Performance

M2-Reasoning-7B establishes new SOTA results across eight major benchmarks: MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista (general reasoning), and CV-Bench and VSI-Bench (spatial reasoning). Notable quantitative outcomes include:

An average general reasoning score of 45.0, an improvement of +9.5 points over the base model.
A CV-Bench spatial evaluation score of 82.3, surpassing prior leading models.
VSI-Bench average score of 42.3, with records in Room Size inference (55.4°) and Relative Direction (47.3°).

These results highlight not only general improvements in multimodal question answering and chain-of-thought reasoning but also significant gains in spatial reasoning tasks requiring the understanding of visual relationships in static and dynamic contexts.

7. Applications and Theoretical Implications

The unified approach of M2-Reasoning-7B extends the applicability of MLLMs to areas such as robotics (autonomous navigation, manipulation), architectural and urban simulation (where dynamic spatial comprehension is critical), and advanced GUI agents. Enhanced spatial reasoning supports more reliable and interpretable responses in real-world, visually grounded environments.

A plausible implication is that the integration of logically validated reasoning trajectories with dynamic multi-task reinforcement strategies can enable MLLMs not only to excel in existing abstract reasoning tasks but also to generalize to novel spatially complex problems. This multi-domain unification, achieved through joint optimization of heterogeneous data and task-specific rewards, provides a template for future models seeking robust multimodal reasoning capacity.

In summary, M2-Reasoning-7B delivers an advanced solution for unified general and spatial reasoning through architectural innovations, a meticulously engineered data pipeline, and adaptive training methodology. Its documented benchmark gains and robust performance across both language and spatial domains mark it as a notable milestone in multimodal model research (AI et al., 11 Jul 2025).

PDF Markdown Chat (Upgrade)

References (1)

1.

M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning (2025)

Follow-up Questions

We haven't generated follow-up questions for this topic yet.

Generate Now