Papers
Topics
Authors
Recent
2000 character limit reached

Multi-Path Perception Policy Optimization (M3PO)

Updated 8 December 2025
  • The paper introduces M3PO, a reinforcement learning framework that integrates multiple parallel reasoning rollouts with cross-path collaborative attention.
  • It employs on-policy gradient updates with normalized group rewards to promote diverse, robust reasoning trajectories in large language models.
  • Empirical results demonstrate significant improvements in knowledge and STEM benchmarks, showcasing the method’s effectiveness over traditional approaches.

Multi-Path Perception Policy Optimization (M3PO) is a reinforcement learning framework for LLMs, engineered to induce robust, multi-hypothesis reasoning through coordinated exploration and collaborative inference. Unlike conventional Chain-of-Thought (CoT) decoding, which generates a single deterministic sequence, or soft-token (continuous mixture) approaches that aggregate semantic alternatives in embedding space, M3PO executes multiple reasoning rollouts in parallel and integrates their intermediate states at each step via a dedicated cross-path mechanism. The policy is then updated using normalized group rewards, promoting learning from collective insight. M3PO has demonstrated state-of-the-art performance on diverse knowledge and STEM reasoning benchmarks, establishing structured multi-path collaboration as an effective inductive bias for complex reasoning in autoregressive LLMs (Lv et al., 1 Dec 2025).

1. Theoretical Objective and Distinction from Prior Methods

Conventional Chain-of-Thought decoding generates a single discrete token sequence, inherently deterministic and limited to exploring one reasoning trajectory per query. Soft-token strategies enable continuous mixtures of token embeddings at each decision step, facilitating gradient-based updates in the pretrained embedding space. However, they reinforce dominant semantic directions without enabling genuine trajectory-level diversity, and remain subject to the isolation imposed by greedy autoregressive decoding.

M3PO formulates its objective as the maximization of the expected group-normalized cumulative reward over NN parallel trajectories, with explicit cross-path interactions infused at each reasoning step:

maxθJM3PO(θ)=E{τi}πθ[1Ni=1NR(τi)]βKL[πθπref]\max_\theta J_{\rm M3PO}(\theta) = \mathbb{E}_{\{\tau_i\} \sim \pi_\theta} \left[ \frac{1}{N} \sum_{i=1}^{N} R(\tau_i) \right] - \beta \mathrm{KL} [\pi_\theta \| \pi_{\rm ref}]

where τi\tau_i is the iith rollout, R(τi)R(\tau_i) is a trajectory reward (binary or scalar, e.g., answer correctness), and the KL term stabilizes learning with respect to a frozen reference policy πref\pi_{\rm ref}. This explicit objective supports simultaneous trajectory exploration and reward normalization across concurrent reasoning paths (Lv et al., 1 Dec 2025).

2. Rollout Generation and Policy Structure

The vocabulary VV and embedding matrix ERV×dE \in \mathbb{R}^{|V| \times d} are initialized from pretrained parameters. Given an input question xx, the policy πθ\pi_\theta generates NN parallel rollouts, each as a token embedding sequence:

τi=[E(x),hˉi(1),hˉi(2),,hˉi(L),E(ai)]\tau_i = [E(x), \bar{h}_i^{(1)}, \bar{h}_i^{(2)}, \dots, \bar{h}_i^{(L)}, E(a_i)]

where LL is the number of reasoning steps, aia_i is the final answer, and hˉi(l)\bar{h}_i^{(l)} is a "hybrid" embedding integrating both local and peer trajectory context at step ll. Each rollout thus forms an independent hypothesis trajectory, but intermediate state evolution is influenced by collaborative information exchange.

At step ll, each rollout ii samples a candidate token embedding ei(l)e_i^{(l)} from its policy distribution pi(l)=πθ(τi(<l))p_i^{(l)} = \pi_\theta(\cdot \mid \tau_i^{(<l)}). This design allows for simultaneous, intertwined yet diverse exploration (Lv et al., 1 Dec 2025).

3. Loss Function and Policy Update

Trajectory rewards R(τi)R(\tau_i) are computed per rollout. The group-relative advantage is calculated as follows:

μ=1Nj=1NR(τj),σ2=1Nj=1N(R(τj)μ)2,A(τi)=R(τi)μσ\mu = \frac{1}{N} \sum_{j=1}^N R(\tau_j), \quad \sigma^2 = \frac{1}{N} \sum_{j=1}^N (R(\tau_j) - \mu)^2, \quad A(\tau_i) = \frac{R(\tau_i) - \mu}{\sigma}

The M3PO gradient updates employ on-policy weighting with advantage normalization and KL regularization:

θJM3PO(θ)=E[1Ni=1N(t=1Lθlogπθ(ei(t)x,hˉi(<t)))A(τi)]βθKL[πθπref]\nabla_\theta J_{\rm M3PO}(\theta) = \mathbb{E} \left[ \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=1}^{L} \nabla_\theta \log \pi_\theta \left(e_i^{(t)} \mid x, \bar h_i^{(<t)} \right) \right) \cdot A(\tau_i) \right] - \beta \nabla_\theta \mathrm{KL} [\pi_\theta \| \pi_{\rm ref}]

In contrast to Proximal Policy Optimization (PPO), M3PO operates strictly on-policy, using raw log-probabilities without likelihood ratios or clipping. This preserves compatibility with the collaborative trajectory design and supports stable, interpretable learning dynamics (Lv et al., 1 Dec 2025).

4. Cross-Path Collaborative Reasoning Mechanism

At each reasoning step ll, M3PO computes a hybrid embedding for each rollout via cross-path attention:

hˉi(l)=(1λ)ei(l)+λci(l),0λ1\bar{h}_i^{(l)} = (1 - \lambda) e_i^{(l)} + \lambda c_i^{(l)}, \quad 0 \leq \lambda \leq 1

where ei(l)e_i^{(l)} is the local embedding and ci(l)c_i^{(l)} is the cross-path contextual embedding. The procedure comprises:

  1. Similarity Matrix: Sij(l)=cosine(pi(l),pj(l))S_{ij}^{(l)} = \mathrm{cosine}(p_i^{(l)}, p_j^{(l)}), with Sii(l)=0S_{ii}^{(l)} = 0 to exclude self-interaction.
  2. Attention Weights: Aij(l)=exp(Sij(l)/T)/kiexp(Sik(l)/T)A_{ij}^{(l)} = \exp ( S_{ij}^{(l)} / T ) / \sum_{k \neq i} \exp ( S_{ik}^{(l)} / T ), introducing a temperature TT for selective attention.
  3. Peer Embedding Fusion: ci(l)=jiAij(l)ej(l)c_i^{(l)} = \sum_{j \neq i} A_{ij}^{(l)} e_j^{(l)}.

This design facilitates information exchange among rollouts with similar next-token distributions, enabling each trajectory to correct local biases through peer input while maintaining individual reasoning strands. The λ\lambda parameter modulates the proportion of peer feedback integrated at each step, with optimal performance observed near λ=0.1\lambda = 0.1 (Lv et al., 1 Dec 2025).

5. Learning Algorithm Implementation

The M3PO training loop operates as follows:

  • Initialize model parameters (θ\theta), reference policy (πref\pi_{\mathrm{ref}}), and hyperparameters (N,λ,T,β)(N, \lambda, T, \beta).
  • For each question xx in a training batch:
    • Generate NN parallel rollouts, each producing a sequence of embeddings.
    • At each reasoning step, compute distributions pi(l)p_i^{(l)}, peer similarities Sij(l)S_{ij}^{(l)}, cross-path fusions ci(l)c_i^{(l)}, and hybrid embeddings hˉi(l)\bar h_i^{(l)}.
    • Complete the reasoning chain with a terminal answer token.
    • Evaluate rewards, compute group-normalized advantages, and update θ\theta via the advantage-weighted on-policy gradient and KL regularizer.

This workflow allows M3PO to jointly optimize trajectory diversity, peer alignment, and policy stability, as formalized in the outlined pseudocode (Lv et al., 1 Dec 2025).

6. Empirical Evaluation and Ablation Studies

M3PO has been evaluated on comprehensive knowledge-intensive (NQ, TriviaQA, HotpotQA, 2WikiMQA, Bamboogle) and STEM (GSM8k, MATH, MATH500, MMLU-STEM, ARC-C) benchmarks. Notable empirical findings include:

  • On Qwen-1.5B, M3PO achieved a 35.6% average Exact Match (EM), exceeding GRPO by 9.5 percentage points (pp) and HRPO by 1.9 pp.
  • On Qwen-3B, M3PO reached 40.2% EM, surpassing GRPO by 3.5 pp and outperforming a 7B RAG baseline by 6.7 pp.
  • In STEM reasoning, Qwen-3B + M3PO delivered 70.5% average accuracy, exceeding GRPO by 1.4 pp and HRPO by 1.8 pp, matching or surpassing 7B-scale few-shot systems. On the MATH dataset, M3PO (3B) achieved 60.7% versus 49.8% for a 7B CoT baseline (+10.9 pp).

Ablations reveal that latent reasoning variants such as "Hidden States" collapse to zero reward and that "Soft Thinking" converges slowly; HRPO improves upon GRPO but remains inferior to M3PO. Modifying the fusion strategy by using uniform peer-mean or eliminating it (λ=0\lambda=0) degrades both convergence speed and maximum reward. Performance is optimal for λ0.1\lambda \approx 0.1 and temperature T0.1T \approx 0.1; higher values in either parameter diminish collaborative selectivity and final accuracy.

Qualitatively, M3PO reasoning chains are shown to be logically coherent and free of formatting noise or looping degeneracy, in contrast to other collaborative or latent reasoning frameworks (Lv et al., 1 Dec 2025).

7. Significance and Implications

M3PO demonstrates that structured, on-policy, multi-path collaboration provides an effective inductive bias for enabling LLMs to internalize robust, generalizable reasoning strategies. By viewing each rollout as an independent hypothesis and systematically integrating peer feedback based on distributional similarity and normalized collective reward, M3PO cultivates reliable multi-step reasoning without increasing model parameter count or incurring inference inefficiency.

This framework establishes a new paradigm for collaborative reasoning in autoregressive models, illustrating that cross-path information exchange can overcome the limitations of deterministic or purely "soft" decoding methods and achieve high accuracy on both knowledge-intensive and formal reasoning benchmarks (Lv et al., 1 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Multi-Path Perception Policy Optimization (M3PO).