Papers
Topics
Authors
Recent
Search
2000 character limit reached

CAMPO: Context-Aware Multi-Stage Optimization

Updated 11 June 2026
  • CAMPO is a reinforcement learning framework that integrates sequential curriculum, context-sensitive rewards, and multi-stage policy updates to optimize complex tasks.
  • It employs length-progressive training and adaptive, context-dependent penalties to handle high-variance, multi-phase challenges in tasks like mathematical reasoning.
  • Empirical results demonstrate that CAMPO improves token efficiency and accuracy across diverse applications, including open-source language models and multi-modal tasks.

Context-Aware Multi-Stage Policy Optimization (CAMPO) is a reinforcement learning (RL) methodology that integrates sequential curriculum, context-sensitive reward shaping, and multi-stage policy update schedules to optimize complex, context-dependent tasks. The approach is characterized by explicit multi-phase training, context-dependent regularization or penalties, and advanced group-normalized policy optimization techniques. CAMPO was introduced and formalized in the context of mathematical reasoning for open-source reasoning LLMs by the MiroMind-M1 project (Li et al., 19 Jul 2025), and closely related multi-stage or context-aware RL approaches are documented in question answering (Zhu et al., 16 Dec 2025), in-context visual localization (Karim et al., 29 May 2026), and multi-agent architectures (Liu et al., 9 Jan 2026).

1. Core Objective and Theoretical Foundations

CAMPO generalizes standard actor-critic and policy-gradient RL objectives such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) to address high-variance settings and complex reward structures. The canonical CAMPO objective incorporates:

  • Multi-stage length-aware clipping schedules: At each stage ss, distinct clipping bounds (φlow(s),φhigh(s))(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s)) control the surrogate policy ratio per-token, decoupling gradient updates per context length regime.
  • Per-token normalization: The conventional sum over all sampled output tokens ioi\sum_{i}|o_i| is used in the denominator to eliminate bias towards longer/shorter generations; this ensures balanced credit assignment across outputs of varying lengths.
  • Adaptive, context-dependent penalties: Distinct from generic entropy or KL penalties, CAMPO integrates penalties derived from sequence properties (e.g., repetition) or task-specific failures.

The stage-specific surrogate objective is: JCAMPO(πθ)=E(q,a),{oi}πθold[1ioiit=1oimin(ri,t(θ)A^i,t,clip(ri,t(θ),1φlow,1+φhigh)A^i,t)]J_{\text{CAMPO}}(\pi_\theta) = \mathbb{E}_{(q, a), \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{\sum_i |o_i|} \sum_i \sum_{t=1}^{|o_i|} \min\left( r_{i,t}(\theta) \cdot \widehat{A}_{i,t}, \operatorname{clip}\big(r_{i,t}(\theta), 1 - \varphi_{\text{low}}, 1 + \varphi_{\text{high}}\big) \cdot \widehat{A}_{i,t} \right) \right] with group-normalized advantage with adaptive penalty: A^i,t=[r(oi,a)f(oi)]μrσr\widehat{A}_{i,t} = \frac{[r(o_i,a) - f(o_i)] - \mu_r}{\sigma_r}

Here, f(oi)f(o_i) is a context-induced penalty (e.g., first detected repetition loop in output), μr, σr\mu_r,\ \sigma_r denote the mean and std across the batch, and ri,t(θ)r_{i,t}(\theta) is the likelihood ratio for the tt-th token of oio_i under new vs. old policy. Only "partially solved" examples, for which not all sampled outputs are correct or incorrect, are retained to focus optimization on ambiguous regimes (Li et al., 19 Jul 2025).

2. Length-Progressive Multi-Stage Training

CAMPO implements a structured curriculum by progressively increasing the maximum allowed generation length (φlow(s),φhigh(s))(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))0 across (φlow(s),φhigh(s))(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))1 stages. For each curriculum stage:

  • Rollout length cap (φlow(s),φhigh(s))(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))2 is enforced: any sampled sequence exceeding this length is automatically assigned zero reward, treating verbosity as a failure.
  • Advancement criteria: Early-stage progression is either step-count-based (e.g., advance after (φlow(s),φhigh(s))(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))3 steps) or conditioned on average model output length saturation on a held-out set, typically at (φlow(s),φhigh(s))(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))4.
  • Benefits: Early, short-length training stages yield faster rollouts, reduce GPU idle time, and prioritize efficient, concise reasoning patterns. Later stages enable the model to generalize to arbitrarily long, multistep solutions, crucial for higher-order reasoning (Li et al., 19 Jul 2025).

Curriculum progression schedules in MiroMind-M1, for instance, use two stages ((φlow(s),φhigh(s))(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))5 tokens) for 7B models and three ((φlow(s),φhigh(s))(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))6) for 32B, empirically accelerating convergence and improving short-budget accuracy.

3. Adaptive Repetition and Degeneracy Penalties

To prevent pathological behaviors such as infinite loops or excessive output repetition—a known failure mode in long-form sequence modeling—CAMPO introduces an explicit, context-sensitive penalty (φlow(s),φhigh(s))(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))7:

(φlow(s),φhigh(s))(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))8

where (φlow(s),φhigh(s))(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))9 is the total length of ioi\sum_{i}|o_i|0. Upon detecting the earliest repeating sub-sequence inside the generation, ioi\sum_{i}|o_i|1 is computed and subtracted from the binary correctness reward prior to normalization.

Ablations on MATH and AIME tasks demonstrate that omitting this term leads to abrupt training divergence, whereas its inclusion yields stable and efficient learning curves, especially in the absence of KL regularization (Li et al., 19 Jul 2025).

4. Implementation and Algorithmic Structure

The end-to-end CAMPO pipeline comprises:

  1. Supervised Fine-Tuning (SFT): The base model is first trained on large, human-verified datasets with chain-of-thought (CoT) trajectories.
    • Hyperparameters: Peak learning rate ioi\sum_{i}|o_i|2 (cosine decay), batch size 128, 3 epochs on verified math CoT pairs, no-packing preferred.
  2. Multi-stage Reinforcement Learning:
    • For each stage ioi\sum_{i}|o_i|3:
      • Cap rollout length at ioi\sum_{i}|o_i|4
      • Set decoupled clip bounds ioi\sum_{i}|o_i|5
      • Sample batches, generate ioi\sum_{i}|o_i|6 outputs per instance, filter for partially solved examples.
      • Compute per-token clipped surrogates and group-normalized advantages with penalty.
      • Inner PPO-style optimization with ioi\sum_{i}|o_i|7–ioi\sum_{i}|o_i|8 epochs per mini-batch.
    • Hyperparameters: Const. RL LR ioi\sum_{i}|o_i|9, JCAMPO(πθ)=E(q,a),{oi}πθold[1ioiit=1oimin(ri,t(θ)A^i,t,clip(ri,t(θ),1φlow,1+φhigh)A^i,t)]J_{\text{CAMPO}}(\pi_\theta) = \mathbb{E}_{(q, a), \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{\sum_i |o_i|} \sum_i \sum_{t=1}^{|o_i|} \min\left( r_{i,t}(\theta) \cdot \widehat{A}_{i,t}, \operatorname{clip}\big(r_{i,t}(\theta), 1 - \varphi_{\text{low}}, 1 + \varphi_{\text{high}}\big) \cdot \widehat{A}_{i,t} \right) \right]0 rollouts/sample, batch size 32, temperature 1.0, no KL penalty.
  3. Pseudocode: The complete update algorithm and batch filtering protocol is formalized (see (Li et al., 19 Jul 2025), Algorithm 1).

5. Empirical Benchmarking and Token Efficiency

CAMPO-trained models (MiroMind-M1-RL-7B/32B) achieve state-of-the-art or highly competitive accuracy and improved token efficiency among Qwen-2.5-base open-source reasoning LLMs:

Model AIME24 AIME25 MATH500
MiroMind-SFT-7B 60.4 45.0 94.6
MiroMind-M1-RL-7B 73.4 57.8 96.7
DeepSeek-R1-Qwen-32B 70.8 52.1 95.8
Skywork-OR1-32B-Prev 77.1 68.2 97.5
MiroMind-M1-RL-32B 77.5 65.6 96.4

Token efficiency metrics establish that, at identical rollout caps, CAMPO-optimized models produce JCAMPO(πθ)=E(q,a),{oi}πθold[1ioiit=1oimin(ri,t(θ)A^i,t,clip(ri,t(θ),1φlow,1+φhigh)A^i,t)]J_{\text{CAMPO}}(\pi_\theta) = \mathbb{E}_{(q, a), \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{\sum_i |o_i|} \sum_i \sum_{t=1}^{|o_i|} \min\left( r_{i,t}(\theta) \cdot \widehat{A}_{i,t}, \operatorname{clip}\big(r_{i,t}(\theta), 1 - \varphi_{\text{low}}, 1 + \varphi_{\text{high}}\big) \cdot \widehat{A}_{i,t} \right) \right]1–JCAMPO(πθ)=E(q,a),{oi}πθold[1ioiit=1oimin(ri,t(θ)A^i,t,clip(ri,t(θ),1φlow,1+φhigh)A^i,t)]J_{\text{CAMPO}}(\pi_\theta) = \mathbb{E}_{(q, a), \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{\sum_i |o_i|} \sum_i \sum_{t=1}^{|o_i|} \min\left( r_{i,t}(\theta) \cdot \widehat{A}_{i,t}, \operatorname{clip}\big(r_{i,t}(\theta), 1 - \varphi_{\text{low}}, 1 + \varphi_{\text{high}}\big) \cdot \widehat{A}_{i,t} \right) \right]2 shorter outputs with equal or better accuracy compared to leading alternatives (Li et al., 19 Jul 2025).

6. Generalization Across Modalities and Tasks

The essential CAMPO paradigm, namely staged context progression, context-sensitive penalties, and adaptive group-advantage policy optimization, generalizes across domains:

  • Long-context QA: In "Context-Picker," a two-stage curriculum (recall-oriented followed by precision-oriented) applies staged redundancy penalties and uses GRPO for stable RL; this delivers superior evidence selection and answer accuracy across five benchmarks (Zhu et al., 16 Dec 2025).
  • Vision-Language Localization: The "FOCUS" approach introduces a first-stage attention-constrained objective to ensure visual grounding, followed by RL refinement (GRPO) on IoU-based rewards, enforcing context-sensitive localization that outperforms pure scaling (Karim et al., 29 May 2026).
  • Multi-agent Multi-hop QA: In "PRISMA," two-stage GRPO first specializes planning and solving, then a context-aware inspector agent applies observation-augmented residual policy optimization (OARPO) to perform targeted recovery, achieving modularity and state-of-the-art generalization (Liu et al., 9 Jan 2026).

A plausible implication is that the distinctive combination of staged objectives and context-aware regularization in CAMPO drives both robustness and efficiency across highly structured generative or reasoning tasks.

7. Ablations, Limitations, and Open Directions

Ablations in MiroMind-M1 and related systems show:

  • No-packing vs. packing: No-packing yields a JCAMPO(πθ)=E(q,a),{oi}πθold[1ioiit=1oimin(ri,t(θ)A^i,t,clip(ri,t(θ),1φlow,1+φhigh)A^i,t)]J_{\text{CAMPO}}(\pi_\theta) = \mathbb{E}_{(q, a), \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{\sum_i |o_i|} \sum_i \sum_{t=1}^{|o_i|} \min\left( r_{i,t}(\theta) \cdot \widehat{A}_{i,t}, \operatorname{clip}\big(r_{i,t}(\theta), 1 - \varphi_{\text{low}}, 1 + \varphi_{\text{high}}\big) \cdot \widehat{A}_{i,t} \right) \right]3–JCAMPO(πθ)=E(q,a),{oi}πθold[1ioiit=1oimin(ri,t(θ)A^i,t,clip(ri,t(θ),1φlow,1+φhigh)A^i,t)]J_{\text{CAMPO}}(\pi_\theta) = \mathbb{E}_{(q, a), \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{\sum_i |o_i|} \sum_i \sum_{t=1}^{|o_i|} \min\left( r_{i,t}(\theta) \cdot \widehat{A}_{i,t}, \operatorname{clip}\big(r_{i,t}(\theta), 1 - \varphi_{\text{low}}, 1 + \varphi_{\text{high}}\big) \cdot \widehat{A}_{i,t} \right) \right]4 point accuracy increase in SFT.
  • Stage curriculum: Multi-stage scheduling trains faster and marginally outperforms single-stage models for short budget settings.
  • Repetition penalty: Omitting yields convergence failure; inclusion is necessary for stability.
  • Verifier quality: Cascade verifiers reduce noisy reward flips and yield more concise, correct completions.

Current limitations include reliance on consistent, high-quality verifiers, susceptibility to credit assignment issues for highly ambiguous cases despite group normalization, and the need to hand-craft context-sensitive penalties for each modality. Open research problems include formal analysis of convergence under non-stationary stage transitions and extension to unbounded curriculum or dynamic stage adaptation (Li et al., 19 Jul 2025).

CAMPO thus represents a modular, extensible framework for robust, efficient, and context-sensitive policy optimization at scale.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Context-Aware Multi-Stage Policy Optimization (CAMPO).