CAMPO: Context-Aware Multi-Stage Optimization

Updated 11 June 2026

CAMPO is a reinforcement learning framework that integrates sequential curriculum, context-sensitive rewards, and multi-stage policy updates to optimize complex tasks.
It employs length-progressive training and adaptive, context-dependent penalties to handle high-variance, multi-phase challenges in tasks like mathematical reasoning.
Empirical results demonstrate that CAMPO improves token efficiency and accuracy across diverse applications, including open-source language models and multi-modal tasks.

Context-Aware Multi-Stage Policy Optimization (CAMPO) is a reinforcement learning (RL) methodology that integrates sequential curriculum, context-sensitive reward shaping, and multi-stage policy update schedules to optimize complex, context-dependent tasks. The approach is characterized by explicit multi-phase training, context-dependent regularization or penalties, and advanced group-normalized policy optimization techniques. CAMPO was introduced and formalized in the context of mathematical reasoning for open-source reasoning LLMs by the MiroMind-M1 project (Li et al., 19 Jul 2025), and closely related multi-stage or context-aware RL approaches are documented in question answering (Zhu et al., 16 Dec 2025), in-context visual localization (Karim et al., 29 May 2026), and multi-agent architectures (Liu et al., 9 Jan 2026).

1. Core Objective and Theoretical Foundations

CAMPO generalizes standard actor-critic and policy-gradient RL objectives such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) to address high-variance settings and complex reward structures. The canonical CAMPO objective incorporates:

Multi-stage length-aware clipping schedules: At each stage $s$ , distinct clipping bounds $(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))$ control the surrogate policy ratio per-token, decoupling gradient updates per context length regime.
Per-token normalization: The conventional sum over all sampled output tokens $\sum_{i}|o_i|$ is used in the denominator to eliminate bias towards longer/shorter generations; this ensures balanced credit assignment across outputs of varying lengths.
Adaptive, context-dependent penalties: Distinct from generic entropy or KL penalties, CAMPO integrates penalties derived from sequence properties (e.g., repetition) or task-specific failures.

The stage-specific surrogate objective is: $J_{\text{CAMPO}}(\pi_\theta) = \mathbb{E}_{(q, a), \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{\sum_i |o_i|} \sum_i \sum_{t=1}^{|o_i|} \min\left( r_{i,t}(\theta) \cdot \widehat{A}_{i,t}, \operatorname{clip}\big(r_{i,t}(\theta), 1 - \varphi_{\text{low}}, 1 + \varphi_{\text{high}}\big) \cdot \widehat{A}_{i,t} \right) \right]$ with group-normalized advantage with adaptive penalty: $\widehat{A}_{i,t} = \frac{[r(o_i,a) - f(o_i)] - \mu_r}{\sigma_r}$

Here, $f(o_i)$ is a context-induced penalty (e.g., first detected repetition loop in output), $\mu_r,\ \sigma_r$ denote the mean and std across the batch, and $r_{i,t}(\theta)$ is the likelihood ratio for the $t$ -th token of $o_i$ under new vs. old policy. Only "partially solved" examples, for which not all sampled outputs are correct or incorrect, are retained to focus optimization on ambiguous regimes (Li et al., 19 Jul 2025).

2. Length-Progressive Multi-Stage Training

CAMPO implements a structured curriculum by progressively increasing the maximum allowed generation length $(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))$ 0 across $(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))$ 1 stages. For each curriculum stage:

Rollout length cap $(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))$ 2 is enforced: any sampled sequence exceeding this length is automatically assigned zero reward, treating verbosity as a failure.
Advancement criteria: Early-stage progression is either step-count-based (e.g., advance after $(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))$ 3 steps) or conditioned on average model output length saturation on a held-out set, typically at $(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))$ 4.
Benefits: Early, short-length training stages yield faster rollouts, reduce GPU idle time, and prioritize efficient, concise reasoning patterns. Later stages enable the model to generalize to arbitrarily long, multistep solutions, crucial for higher-order reasoning (Li et al., 19 Jul 2025).

Curriculum progression schedules in MiroMind-M1, for instance, use two stages ( $(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))$ 5 tokens) for 7B models and three ( $(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))$ 6) for 32B, empirically accelerating convergence and improving short-budget accuracy.

3. Adaptive Repetition and Degeneracy Penalties

To prevent pathological behaviors such as infinite loops or excessive output repetition—a known failure mode in long-form sequence modeling—CAMPO introduces an explicit, context-sensitive penalty $(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))$ 7:

$(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))$ 8

where $(\varphi_{\text{low}}(s), \varphi_{\text{high}}(s))$ 9 is the total length of $\sum_{i}|o_i|$ 0. Upon detecting the earliest repeating sub-sequence inside the generation, $\sum_{i}|o_i|$ 1 is computed and subtracted from the binary correctness reward prior to normalization.

Ablations on MATH and AIME tasks demonstrate that omitting this term leads to abrupt training divergence, whereas its inclusion yields stable and efficient learning curves, especially in the absence of KL regularization (Li et al., 19 Jul 2025).

4. Implementation and Algorithmic Structure

The end-to-end CAMPO pipeline comprises:

Supervised Fine-Tuning (SFT): The base model is first trained on large, human-verified datasets with chain-of-thought (CoT) trajectories.
- Hyperparameters: Peak learning rate $\sum_{i}|o_i|$ 2 (cosine decay), batch size 128, 3 epochs on verified math CoT pairs, no-packing preferred.
Multi-stage Reinforcement Learning:
- For each stage $\sum_{i}|o_i|$ $\sum_{i} ∣ o_{i} ∣$ 3:
  - Cap rollout length at $\sum_{i}|o_i|$ 4
  - Set decoupled clip bounds $\sum_{i}|o_i|$ 5
  - Sample batches, generate $\sum_{i}|o_i|$ 6 outputs per instance, filter for partially solved examples.
  - Compute per-token clipped surrogates and group-normalized advantages with penalty.
  - Inner PPO-style optimization with $\sum_{i}|o_i|$ 7– $\sum_{i}|o_i|$ 8 epochs per mini-batch.
- Hyperparameters: Const. RL LR $\sum_{i}|o_i|$ 9, $J_{\text{CAMPO}}(\pi_\theta) = \mathbb{E}_{(q, a), \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{\sum_i |o_i|} \sum_i \sum_{t=1}^{|o_i|} \min\left( r_{i,t}(\theta) \cdot \widehat{A}_{i,t}, \operatorname{clip}\big(r_{i,t}(\theta), 1 - \varphi_{\text{low}}, 1 + \varphi_{\text{high}}\big) \cdot \widehat{A}_{i,t} \right) \right]$ 0 rollouts/sample, batch size 32, temperature 1.0, no KL penalty.
Pseudocode: The complete update algorithm and batch filtering protocol is formalized (see (Li et al., 19 Jul 2025), Algorithm 1).

5. Empirical Benchmarking and Token Efficiency

CAMPO-trained models (MiroMind-M1-RL-7B/32B) achieve state-of-the-art or highly competitive accuracy and improved token efficiency among Qwen-2.5-base open-source reasoning LLMs:

Model	AIME24	AIME25	MATH500
MiroMind-SFT-7B	60.4	45.0	94.6
MiroMind-M1-RL-7B	73.4	57.8	96.7
DeepSeek-R1-Qwen-32B	70.8	52.1	95.8
Skywork-OR1-32B-Prev	77.1	68.2	97.5
MiroMind-M1-RL-32B	77.5	65.6	96.4

Token efficiency metrics establish that, at identical rollout caps, CAMPO-optimized models produce $J_{\text{CAMPO}}(\pi_\theta) = \mathbb{E}_{(q, a), \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{\sum_i |o_i|} \sum_i \sum_{t=1}^{|o_i|} \min\left( r_{i,t}(\theta) \cdot \widehat{A}_{i,t}, \operatorname{clip}\big(r_{i,t}(\theta), 1 - \varphi_{\text{low}}, 1 + \varphi_{\text{high}}\big) \cdot \widehat{A}_{i,t} \right) \right]$ 1– $J_{\text{CAMPO}}(\pi_\theta) = \mathbb{E}_{(q, a), \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{\sum_i |o_i|} \sum_i \sum_{t=1}^{|o_i|} \min\left( r_{i,t}(\theta) \cdot \widehat{A}_{i,t}, \operatorname{clip}\big(r_{i,t}(\theta), 1 - \varphi_{\text{low}}, 1 + \varphi_{\text{high}}\big) \cdot \widehat{A}_{i,t} \right) \right]$ 2 shorter outputs with equal or better accuracy compared to leading alternatives (Li et al., 19 Jul 2025).

6. Generalization Across Modalities and Tasks

The essential CAMPO paradigm, namely staged context progression, context-sensitive penalties, and adaptive group-advantage policy optimization, generalizes across domains:

Long-context QA: In "Context-Picker," a two-stage curriculum (recall-oriented followed by precision-oriented) applies staged redundancy penalties and uses GRPO for stable RL; this delivers superior evidence selection and answer accuracy across five benchmarks (Zhu et al., 16 Dec 2025).
Vision-Language Localization: The "FOCUS" approach introduces a first-stage attention-constrained objective to ensure visual grounding, followed by RL refinement (GRPO) on IoU-based rewards, enforcing context-sensitive localization that outperforms pure scaling (Karim et al., 29 May 2026).
Multi-agent Multi-hop QA: In "PRISMA," two-stage GRPO first specializes planning and solving, then a context-aware inspector agent applies observation-augmented residual policy optimization (OARPO) to perform targeted recovery, achieving modularity and state-of-the-art generalization (Liu et al., 9 Jan 2026).

A plausible implication is that the distinctive combination of staged objectives and context-aware regularization in CAMPO drives both robustness and efficiency across highly structured generative or reasoning tasks.

7. Ablations, Limitations, and Open Directions

Ablations in MiroMind-M1 and related systems show:

No-packing vs. packing: No-packing yields a $J_{\text{CAMPO}}(\pi_\theta) = \mathbb{E}_{(q, a), \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{\sum_i |o_i|} \sum_i \sum_{t=1}^{|o_i|} \min\left( r_{i,t}(\theta) \cdot \widehat{A}_{i,t}, \operatorname{clip}\big(r_{i,t}(\theta), 1 - \varphi_{\text{low}}, 1 + \varphi_{\text{high}}\big) \cdot \widehat{A}_{i,t} \right) \right]$ 3– $J_{\text{CAMPO}}(\pi_\theta) = \mathbb{E}_{(q, a), \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{\sum_i |o_i|} \sum_i \sum_{t=1}^{|o_i|} \min\left( r_{i,t}(\theta) \cdot \widehat{A}_{i,t}, \operatorname{clip}\big(r_{i,t}(\theta), 1 - \varphi_{\text{low}}, 1 + \varphi_{\text{high}}\big) \cdot \widehat{A}_{i,t} \right) \right]$ 4 point accuracy increase in SFT.
Stage curriculum: Multi-stage scheduling trains faster and marginally outperforms single-stage models for short budget settings.
Repetition penalty: Omitting yields convergence failure; inclusion is necessary for stability.
Verifier quality: Cascade verifiers reduce noisy reward flips and yield more concise, correct completions.

Current limitations include reliance on consistent, high-quality verifiers, susceptibility to credit assignment issues for highly ambiguous cases despite group normalization, and the need to hand-craft context-sensitive penalties for each modality. Open research problems include formal analysis of convergence under non-stationary stage transitions and extension to unbounded curriculum or dynamic stage adaptation (Li et al., 19 Jul 2025).

CAMPO thus represents a modular, extensible framework for robust, efficient, and context-sensitive policy optimization at scale.