M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models

Published 11 May 2026 in cs.AI | (2605.09879v1)

Abstract: While reasoning has become a central capability of LLMs, the reasoning patterns required for different scenarios are often misaligned. Mathematical reasoning typically relies on intrinsic logic to solve closed-world problems in a single response, whereas agentic reasoning requires not only internal reasoning but also multi-turn interaction with external environments, interleaving thought and action. This misalignment prevents mathematical and agentic reasoning from effectively benefiting from each other, often yielding unstable reasoning behavior and only limited performance gains under multi-task learning. In this paper, we propose M2A, a novel paradigm that synergizes mathematical and agentic reasoning via model merging. To avoid overfitting to superficial reasoning patterns under joint training, M2A operates directly in parameter space: it identifies the feature subspace critical for agent behavior, and merges the mathematical reasoning task vector only along its null space, thereby injecting reasoning capability along directions that do not perturb agent behavior. Unlike SFT or RL, M2A requires no additional gradient-update and exposes the merging coefficient as a simple knob for controlling reasoning length. Experiments in a challenging real-world coding agent setting show that our method effectively extends agentic reasoning depth and delivers substantial performance improvements. Applied to a fine-tuned Qwen3-8B, M2A improves its SWE-Bench Verified resolved rate from 44.0% to 51.2% without retraining the model. Code is available at https://github.com/laplucky/M2A.git.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a novel training-free paradigm that synergizes mathematical and agentic reasoning using behavior-preserving null-space merging.
It achieves a 7.2% improvement on SWE-Bench Verified by controlling reasoning depth while preserving the interactive agent behavior.
The method provides fine-grained control over reasoning and action balance, ensuring enhanced stability, interpretability, and computational efficiency.

Synergizing Mathematical and Agentic Reasoning with M2A

Motivation and Problem Formulation

Recent advances in LLMs have led to significant progress both in mathematical reasoning (e.g., chain-of-thought on closed-domain problems) and agentic reasoning (multi-turn, interaction-driven problem solving in open environments). However, these two forms of reasoning embody fundamentally different behavior patterns: mathematical reasoning typically involves single-turn, internal logic construction, while agentic reasoning requires intermittent alternation between internal thought and external action via multi-turn environment interaction. Empirical evidence demonstrates that naive multi-task training or model merging fails to synergize these modalities, leading to unstable and often degraded agentic behavior.

The paper "M2A: Synergizing Mathematical and Agentic Reasoning in LLMs" (2605.09879) addresses this lack of synergy. M2A introduces a training-free paradigm: it merges mathematical reasoning into agentic LLMs by aligning parameter updates in a behavior-preserving manner. The core design objective is to enhance internal reasoning while strictly maintaining the agent-specific think–act–observe interaction loop, overcoming the conflicts that afflict standard joint training or brute-force weight merging.

M2A Approach: Behavior-Preserving Model Merging

Failure Modes of Existing Approaches

Multi-task supervised fine-tuning (SFT) on a mixture of mathematical and agentic datasets superficially transfers the pattern of longer reasoning chains to agent models but fails to improve agentic task completion. In practice, it can even degrade agent performance by disrupting the agent’s interaction frequency.

Naive model merging—directly averaging or linearly combining parameter deltas from reasoning and agent models—does not respect the behavioral distinctions between the two domains. This often results in models that ‘overthink’ (i.e., generate lengthy internal reasoning without acting) or fail to interact effectively with the environment.

Behaviorally-Informed Null-Space Merging

M2A reframes the integration challenge as a problem of preserving the agent’s behavior-critical parameter subspace. The solution comprises three main components:

Agent-Critical Subspace Calibration: M2A identifies the directions in hidden state space that encode agentic behavior by extracting features around behavioral transition markers (e.g., special tokens for thinking, function calls, and their boundaries) from agent model activations.
Null-Space Projection: Before merging, the parameter updates responsible for mathematical reasoning are projected to the orthogonal complement (null space) of the agent-critical subspace for each layer. This ensures that reasoning injection does not perturb the think–act–observe dynamics encoded by the agent.
Adaptive Layer-Wise Merging: The magnitude and locus of reasoning integration are adaptively controlled. Each layer’s merge coefficient is scaled to normalise the norm mismatch between agent and reasoning parameter deltas, and only layers exhibiting sufficient update alignment (as measured by cosine similarity) are merged.

By these mechanisms, M2A injects mathematical reasoning ability while strictly prohibiting update directions that would interfere with agentic action patterns.

Empirical Results and Analysis

M2A is evaluated on SWE-Bench Verified, a challenging real-world coding-agent benchmark based on resolving GitHub issues. Critical observations regarding performance and behavioral traits are as follows.

Figure 1: (a) M2A demonstrates simultaneous improvements in mathematical and agentic reasoning benchmarks; (b) Component ablation shows that null-space projection, layer normalization, and layer masking are crucial for M2A's superior agentic performance.

Strong numerical gains: On fine-tuned Qwen3-8B, M2A raises the resolved rate on SWE-Bench Verified from 44.0% to 51.2%—a 7.2 point improvement without any additional model training.
Stability and interpretability: Average per-step reasoning length increases (from 253.3 to 327.4 tokens), while average interaction steps remain high (178.0), indicating that enhanced reasoning does not come at the cost of reduced environment interaction.
Component necessity: Ablation reveals that removal of null-space projection or adaptive merging mechanisms sharply diminishes performance and stability, establishing the necessity of each pipeline stage.
Figure 2: Increasing merge strength β in M2A yields smooth, predictable growth in reasoning depth and maintains robust performance up to a well-behaved capacity limit; termination shift analysis indicates the method's regularization effect on failure modes.
Control knob for reasoning behavior: The merge strength parameter, β, reliably modulates reasoning length nearly linearly in the performant regime, allowing practitioners fine control over the tradeoff between reasoning intensity and interaction step count. This stands in contrast to standard baselines, where increased blending often precipitates abrupt behavioral collapse.

Theoretical and Practical Implications

M2A demonstrates that classical multi-task generalization failures in LLMs, caused by interference between behaviorally distinct domains, can be circumvented by geometric control in parameter space. By explicitly identifying and protecting behavioral subspaces specific to agentic operation, reasoning enhancements become synergistic rather than disruptive.

Behaviorally-informed merging provides several advantages:

Train-free upskilling: M2A achieves substantial capability transfer without gradient-based adaptation, making it computationally attractive for large-scale or resource-constrained deployments.
Behavioral stability: Unlike naïve merging, which can induce pathological activity (e.g., infinite internal monologues), M2A ensures that agents act and observe at the correct junctures, leading to higher productivity and fewer context exhaustion failures.
Fine-grained control: Practitioners can explicitly tune reasoning depth for domain- or task-specific requirements by adjusting merge strength and layer selection.

Empirical Insights on Trajectory Shift

Qualitative analysis confirms that M2A induces a measurable shift from trial-and-error editing toward evidence-grounded action. Agents using the merged model gather more evidence, commit fewer but more targeted edits, and maintain robust performance in the presence of longer reasoning traces. Critically, these advances do not degrade general instruction-following ability, maintaining versatility beyond the agentic context.

Future Directions

This paradigm opens avenues for further exploration:

Extending behavior-critical subspace calibration to more diverse agentic behaviors and multimodal domains
Automated discovery of optimal merge strength and protected subspaces for arbitrary task pairs
Exploiting behaviorally-preserving merging for continual, life-long agent training without catastrophic forgetting.

The fine-grained behavioral control and stability conferred by M2A could lead to more interpretable, reliable, and customizable LLM agents—both in software engineering and broader autonomous interaction scenarios.

Conclusion

M2A offers a principled, training-free method to synergize mathematical and agentic reasoning in LLMs by behavior-preserving null-space model merging. The result is a substantial and controllable performance increase in agentic tasks, without sacrificing stability or generality. This work establishes a technical foundation for future LLM systems that integrate heterogeneous reasoning modalities without behavior interference.

Markdown Report Issue