Conditional Listwise Policy Optimization
- Conditional Listwise Policy Optimization (CLPO) is a reinforcement learning objective that separates decision selection from rationale generation to enable clear credit assignment.
- It combines policy gradients for discrete choice with a listwise ranking loss for justifications, ensuring robust performance in complex reward settings.
- Empirical results show that CLPO improves performance on tasks like GSM8K and AMC in multi-agent LLM systems, demonstrating enhanced alignment and convergence.
Conditional Listwise Policy Optimization (CLPO) is a reinforcement learning objective designed to align the policy of LLMs and multi-agent LLM systems with complex reward structures, particularly where model decisions and justifications must be separated for robust credit assignment. CLPO is notably introduced as the synthesis phase optimization in the Maestro multi-agent collaboration framework and has also been instantiated for LLM alignment in the LiPO framework. Central to CLPO is its decoupling of the strategic selection (“which output to choose”) from the tactical rationale generation (“why this output is preferable”), accomplished via a combination of discrete decision-focused policy gradients and listwise learning-to-rank losses for rationales.
1. Formal Definition and Mathematical Structure
CLPO for multi-agent LLM evaluation, as in Maestro (Yang et al., 8 Nov 2025), is formulated over slates of candidate solutions. Let where is a prompt and is a candidate slate. For each candidate :
- : index of chosen candidate.
- : policy for selection.
- : generated tokens for justification.
- : sequence-level reward for .
- : slate mean reward.
- : advantage for .
- : normalized log-prob of .
The CLPO objective combines two principal losses plus regularizers:
- Strategic Decision Loss () — focuses policy gradient on the discrete selection:
- Tactical Argumentation Loss () — enforces listwise ranking of rationales using the Plackett-Luce loss:
where orders candidates by descending reward ().
With KL regularization () to a reference policy and entropy regularization (), the total objective is:
2. Derivation and Theoretical Motivation
CLPO emerges from the decomposition of a canonical RL objective that entangles both the correctness of answer selection () and the comparative quality of rationale ():
The credit assignment issue is addressed by splitting into separate updates:
- The choice update uses a within-slate normalized advantage , focusing learning on relative, not absolute, correctness.
- The listwise ranking loss on justification log-probs ensures that rationales attached to better candidates are more likely under the model.
The Plackett–Luce loss assigns marginal likelihood over all possible ranked suffixes, making the supervision signal more robust for sequence-level outputs compared to pointwise or pairwise variants.
KL and entropy regularization prevent policy drift and mode collapse in rationale generation. Variance reduction is achieved via within-slate baselining.
A comparable formalism appears in LM alignment via LiPO (Liu et al., 2 Feb 2024), with the conditional listwise structure enforced through listwise rank losses (e.g., list-MLE, ListNet, or -weighted pairwise logistic) over model log-probs conditioned on a prompt.
3. Training Algorithm and Pseudocode
CLPO training involves generating candidate slates, computing rewards, and updating model parameters via the composite loss. The pseudocode below summarizes the Maestro CLPO update cycle:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
for each q^i in batch: # Step 1: Candidate generation C^i = {c^i_1, ..., c^i_m} # Execution Agents sample N·K candidates # Step 2: Candidate rewards and advantage for k in 1..m: r^i_k = reward(c^i_k) # correctness + rationale heuristic mean_r = sum_k r^i_k / m A^i_k = r^i_k - mean_r # Step 3: Decision loss L_choice^i = -sum_k A^i_k * log pi_theta(k | s^i) # Step 4: Justification generation and ranking s^i_k = mean_tau log pi_theta(y_{k, tau} | y_{k,1:tau-1}, s^i) order sigma^i by descending r^i_k L_reason_rank^i = -sum_j log (exp(s^i_{sigma_j}) / sum_{l=j}^m exp(s^i_{sigma_l})) L_total = mean_i L_choice^i + lambda_rank * mean_i L_reason_rank^i + ... update theta with Adam or related optimizer |
Hyperparameters such as batch size, optimizer, KL/entropy weights, LoRA rank, and backbone model selection follow the published guidelines (Section 4).
4. Implementation Parameters and Workflow
CLPO has been primarily deployed atop LLaMA and Qwen LLM backbones using parameter-efficient fine-tuning (LoRA: rank 16, alpha 32, dropout 0.05), executed at a batch size of 32 across 4×A100 (80GB) GPUs with mixed precision (bfloat16). Training uses the Adam optimizer (), an initial learning rate of (cosine schedule), and gradient norm clipping at 1.0. Loss component weights generally use , , and .
Candidate slate sizes typically result from execution agents each generating samples (), with , in default configurations and up to consensus rounds. Decoding for decision tokens is deterministic (temperature 0.0) while rationale generation is sampled with nucleus sampling ().
5. Empirical Results and Comparative Analysis
CLPO-equipped Maestro achieves notable improvements over prior multi-agent and single-agent LLM collaboration schemes. On LLaMA-8B, Maestro with CLPO attains 89.33% on GSM8K (vs. 88.67% with GRPO, 87.69% with SFT) and 28.52% on AMC (vs. 26.30% with GRPO). Across a suite of reasoning and factual tasks—including MATH, AIME, MMLU, and HumanEval—CLPO delivers average absolute gains of approximately 6% and up to 10% in some backbone/benchmark settings. Substitute of CLPO with GRPO or SFT yields consistent degradation (by 1–2% absolute).
Ablation studies show the essentiality of both and . Omitting either sharply degrades final performance; e.g., rationale-only ranking damages decision accuracy, while decision-only learning yields weak explanations. Optimal ranking weight lies in [0.5, 0.8]; lower or higher values reduce accuracy and rationale discrimination. Zero-shot prompting on GPT-4o-mini also benefits, with GSM8K improve from 94.89% (AgentPrune) to 95.60%.
Table: Benchmark Results — Maestro + CLPO, Multi-Agent Baselines, and Ablation (Yang et al., 8 Nov 2025)
| System | GSM8K (%) | AMC (%) | HumanEval Pass@1 |
|---|---|---|---|
| Maestro + CLPO | 89.33 | 28.52 | see text |
| GRPO | 88.67 | 26.30 | |
| SFT | 87.69 | — | |
| Best prior (avg) | – | – | ±6% lower |
6. Relationship to Listwise LM Alignment Approaches
CLPO's listwise structure closely parallels the LiPO framework for LLM alignment (Liu et al., 2 Feb 2024). In LiPO, listwise ranks (normalized preference labels) are attached per prompt-response to induce a variety of ranking losses, including pairwise logistic (as in DPO_BT), pairwise hinge (SLiC_norm), list-MLE (Plackett–Luce), and a -weighted listwise loss that exploits both label magnitude and model-predicted rank. The loss in LiPO:
where is the sorted permutation of labels, is mathematically identical to the component of Maestro CLPO.
LiPO-λ (), which weights pairwise losses by gain and rank-discount, yields consistent empirical improvements as list size increases and as compared to DPO_PL or SLiC_norm. Notably, only the -weighted loss (as opposed to constant or partial weights) improves monotonically up to and outperforms prior pairwise and pointwise alignment methods in both reward-model and human evaluations. Results for Reddit TL;DR and AnthropicHH confirm that list-MLE/listwise losses enhance learning from comparative feedback.
7. Strengths, Limitations, and Future Directions
CLPO offers a clean separation between decision and rationale supervision, leading to improved credit assignment and stable convergence even with large or diverse candidate pools. Listwise supervision reduces noise compared to traditional sequence-level RL, and empirical results indicate superiority across model sizes, tasks, and collaboration strategies.
However, CLPO targets only the convergence (“synthesis”) phase in Maestro; the exploration (“generation”) agents persist as fixed, potentially missing additional joint optimization gains. Computational complexity scales as for large slates in the worst case, constraining applicability to extensive candidate sets. The reward function , which guides both selection and ranking, can itself be a source of bias or inadequacy if not carefully constructed.
Directions for future research include extending CLPO-style objectives to jointly optimize both execution and synthesis agents, dynamically adjusting slate sizes, end-to-end optimization across all roles, and broadening to open-ended or unstructured domains such as dialogue or planning. Improvements in reward elicitation and handling uncertainty in ranking are also noted as promising enhancements.
CLPO thus constitutes a foundational method for robust, reliable collaborative reasoning and policy alignment in large-scale multi-agent and single-model LLM systems, offering theoretical clarity and demonstrated empirical utility.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free