Multi-Path Perception Policy Optimization (M3PO)
- The paper introduces M3PO, a reinforcement learning framework that integrates multiple parallel reasoning rollouts with cross-path collaborative attention.
- It employs on-policy gradient updates with normalized group rewards to promote diverse, robust reasoning trajectories in large language models.
- Empirical results demonstrate significant improvements in knowledge and STEM benchmarks, showcasing the method’s effectiveness over traditional approaches.
Multi-Path Perception Policy Optimization (M3PO) is a reinforcement learning framework for LLMs, engineered to induce robust, multi-hypothesis reasoning through coordinated exploration and collaborative inference. Unlike conventional Chain-of-Thought (CoT) decoding, which generates a single deterministic sequence, or soft-token (continuous mixture) approaches that aggregate semantic alternatives in embedding space, M3PO executes multiple reasoning rollouts in parallel and integrates their intermediate states at each step via a dedicated cross-path mechanism. The policy is then updated using normalized group rewards, promoting learning from collective insight. M3PO has demonstrated state-of-the-art performance on diverse knowledge and STEM reasoning benchmarks, establishing structured multi-path collaboration as an effective inductive bias for complex reasoning in autoregressive LLMs (Lv et al., 1 Dec 2025).
1. Theoretical Objective and Distinction from Prior Methods
Conventional Chain-of-Thought decoding generates a single discrete token sequence, inherently deterministic and limited to exploring one reasoning trajectory per query. Soft-token strategies enable continuous mixtures of token embeddings at each decision step, facilitating gradient-based updates in the pretrained embedding space. However, they reinforce dominant semantic directions without enabling genuine trajectory-level diversity, and remain subject to the isolation imposed by greedy autoregressive decoding.
M3PO formulates its objective as the maximization of the expected group-normalized cumulative reward over parallel trajectories, with explicit cross-path interactions infused at each reasoning step:
where is the th rollout, is a trajectory reward (binary or scalar, e.g., answer correctness), and the KL term stabilizes learning with respect to a frozen reference policy . This explicit objective supports simultaneous trajectory exploration and reward normalization across concurrent reasoning paths (Lv et al., 1 Dec 2025).
2. Rollout Generation and Policy Structure
The vocabulary and embedding matrix are initialized from pretrained parameters. Given an input question , the policy generates parallel rollouts, each as a token embedding sequence:
where is the number of reasoning steps, is the final answer, and is a "hybrid" embedding integrating both local and peer trajectory context at step . Each rollout thus forms an independent hypothesis trajectory, but intermediate state evolution is influenced by collaborative information exchange.
At step , each rollout samples a candidate token embedding from its policy distribution . This design allows for simultaneous, intertwined yet diverse exploration (Lv et al., 1 Dec 2025).
3. Loss Function and Policy Update
Trajectory rewards are computed per rollout. The group-relative advantage is calculated as follows:
The M3PO gradient updates employ on-policy weighting with advantage normalization and KL regularization:
In contrast to Proximal Policy Optimization (PPO), M3PO operates strictly on-policy, using raw log-probabilities without likelihood ratios or clipping. This preserves compatibility with the collaborative trajectory design and supports stable, interpretable learning dynamics (Lv et al., 1 Dec 2025).
4. Cross-Path Collaborative Reasoning Mechanism
At each reasoning step , M3PO computes a hybrid embedding for each rollout via cross-path attention:
where is the local embedding and is the cross-path contextual embedding. The procedure comprises:
- Similarity Matrix: , with to exclude self-interaction.
- Attention Weights: , introducing a temperature for selective attention.
- Peer Embedding Fusion: .
This design facilitates information exchange among rollouts with similar next-token distributions, enabling each trajectory to correct local biases through peer input while maintaining individual reasoning strands. The parameter modulates the proportion of peer feedback integrated at each step, with optimal performance observed near (Lv et al., 1 Dec 2025).
5. Learning Algorithm Implementation
The M3PO training loop operates as follows:
- Initialize model parameters (), reference policy (), and hyperparameters .
- For each question in a training batch:
- Generate parallel rollouts, each producing a sequence of embeddings.
- At each reasoning step, compute distributions , peer similarities , cross-path fusions , and hybrid embeddings .
- Complete the reasoning chain with a terminal answer token.
- Evaluate rewards, compute group-normalized advantages, and update via the advantage-weighted on-policy gradient and KL regularizer.
This workflow allows M3PO to jointly optimize trajectory diversity, peer alignment, and policy stability, as formalized in the outlined pseudocode (Lv et al., 1 Dec 2025).
6. Empirical Evaluation and Ablation Studies
M3PO has been evaluated on comprehensive knowledge-intensive (NQ, TriviaQA, HotpotQA, 2WikiMQA, Bamboogle) and STEM (GSM8k, MATH, MATH500, MMLU-STEM, ARC-C) benchmarks. Notable empirical findings include:
- On Qwen-1.5B, M3PO achieved a 35.6% average Exact Match (EM), exceeding GRPO by 9.5 percentage points (pp) and HRPO by 1.9 pp.
- On Qwen-3B, M3PO reached 40.2% EM, surpassing GRPO by 3.5 pp and outperforming a 7B RAG baseline by 6.7 pp.
- In STEM reasoning, Qwen-3B + M3PO delivered 70.5% average accuracy, exceeding GRPO by 1.4 pp and HRPO by 1.8 pp, matching or surpassing 7B-scale few-shot systems. On the MATH dataset, M3PO (3B) achieved 60.7% versus 49.8% for a 7B CoT baseline (+10.9 pp).
Ablations reveal that latent reasoning variants such as "Hidden States" collapse to zero reward and that "Soft Thinking" converges slowly; HRPO improves upon GRPO but remains inferior to M3PO. Modifying the fusion strategy by using uniform peer-mean or eliminating it () degrades both convergence speed and maximum reward. Performance is optimal for and temperature ; higher values in either parameter diminish collaborative selectivity and final accuracy.
Qualitatively, M3PO reasoning chains are shown to be logically coherent and free of formatting noise or looping degeneracy, in contrast to other collaborative or latent reasoning frameworks (Lv et al., 1 Dec 2025).
7. Significance and Implications
M3PO demonstrates that structured, on-policy, multi-path collaboration provides an effective inductive bias for enabling LLMs to internalize robust, generalizable reasoning strategies. By viewing each rollout as an independent hypothesis and systematically integrating peer feedback based on distributional similarity and normalized collective reward, M3PO cultivates reliable multi-step reasoning without increasing model parameter count or incurring inference inefficiency.
This framework establishes a new paradigm for collaborative reasoning in autoregressive models, illustrating that cross-path information exchange can overcome the limitations of deterministic or purely "soft" decoding methods and achieve high accuracy on both knowledge-intensive and formal reasoning benchmarks (Lv et al., 1 Dec 2025).