Papers
Topics
Authors
Recent
2000 character limit reached

Conditional Listwise Policy Optimization

Updated 15 November 2025
  • Conditional Listwise Policy Optimization (CLPO) is a reinforcement learning objective that separates decision selection from rationale generation to enable clear credit assignment.
  • It combines policy gradients for discrete choice with a listwise ranking loss for justifications, ensuring robust performance in complex reward settings.
  • Empirical results show that CLPO improves performance on tasks like GSM8K and AMC in multi-agent LLM systems, demonstrating enhanced alignment and convergence.

Conditional Listwise Policy Optimization (CLPO) is a reinforcement learning objective designed to align the policy of LLMs and multi-agent LLM systems with complex reward structures, particularly where model decisions and justifications must be separated for robust credit assignment. CLPO is notably introduced as the synthesis phase optimization in the Maestro multi-agent collaboration framework and has also been instantiated for LLM alignment in the LiPO framework. Central to CLPO is its decoupling of the strategic selection (“which output to choose”) from the tactical rationale generation (“why this output is preferable”), accomplished via a combination of discrete decision-focused policy gradients and listwise learning-to-rank losses for rationales.

1. Formal Definition and Mathematical Structure

CLPO for multi-agent LLM evaluation, as in Maestro (Yang et al., 8 Nov 2025), is formulated over slates of candidate solutions. Let s=(q,C)s = (q, C) where qq is a prompt and C={c1,...,cm}C = \{c_1, ..., c_m\} is a candidate slate. For each candidate ckc_k:

  • a{1,...,m}a \in \{1, ..., m\}: index of chosen candidate.
  • πθ(as)\pi_\theta(a \mid s): policy for selection.
  • yk=(yk,1,...,yk,Lk)y_k = (y_{k,1}, ..., y_{k,L_k}): generated tokens for justification.
  • r(ck)r(c_k): sequence-level reward for ckc_k.
  • Aˉ=1m=1mr(c)\bar A = \frac{1}{m} \sum_{\ell=1}^m r(c_\ell): slate mean reward.
  • Ak=r(ck)AˉA_k = r(c_k) - \bar A: advantage for ckc_k.
  • sk=(1/Lk)τ=1Lklogπθ(yk,τyk,1:τ1,s)s_k = (1/L_k) \sum_{\tau=1}^{L_k} \log \pi_\theta(y_{k,\tau} \mid y_{k,1:\tau-1}, s): normalized log-prob of yky_k.

The CLPO objective combines two principal losses plus regularizers:

  1. Strategic Decision Loss (LchoiceL_\text{choice}) — focuses policy gradient on the discrete selection:

Lchoice=Es,C[k=1mAklogπθ(ks)]L_\text{choice} = - \mathbb{E}_{s,C} \left[ \sum_{k=1}^m A_k \cdot \log \pi_\theta(k \mid s)\right]

  1. Tactical Argumentation Loss (Lreason_rankL_\text{reason\_rank}) — enforces listwise ranking of rationales using the Plackett-Luce loss:

Lreason_rank=j=1mlogexp(sσj)=jmexp(sσ)L_\text{reason\_rank} = - \sum_{j=1}^m \log \frac{\exp(s_{\sigma_j})}{\sum_{\ell=j}^m \exp(s_{\sigma_\ell})}

where σ\sigma orders candidates by descending reward (r(cσ1)r(cσm)r(c_{\sigma_1}) \geq \ldots \geq r(c_{\sigma_m})).

With KL regularization (LKLL_\text{KL}) to a reference policy and entropy regularization (LentL_\text{ent}), the total objective is:

LCLPO=Lchoice+λrankLreason_rank+λklLKLλentLentL_\text{CLPO} = L_\text{choice} + \lambda_\text{rank} \cdot L_\text{reason\_rank} + \lambda_\text{kl} \cdot L_\text{KL} - \lambda_\text{ent} \cdot L_\text{ent}

2. Derivation and Theoretical Motivation

CLPO emerges from the decomposition of a canonical RL objective that entangles both the correctness of answer selection (aa) and the comparative quality of rationale (yy):

maxθEs,C,a,yπθ[r(a,y;s)]\max_\theta \mathbb{E}_{s,C,a,y\sim\pi_\theta}[r(a,y; s)]

The credit assignment issue is addressed by splitting into separate updates:

  • The choice update uses a within-slate normalized advantage AkA_k, focusing learning on relative, not absolute, correctness.
  • The listwise ranking loss on justification log-probs ensures that rationales attached to better candidates are more likely under the model.

The Plackett–Luce loss assigns marginal likelihood over all possible ranked suffixes, making the supervision signal more robust for sequence-level outputs compared to pointwise or pairwise variants.

KL and entropy regularization prevent policy drift and mode collapse in rationale generation. Variance reduction is achieved via within-slate baselining.

A comparable formalism appears in LM alignment via LiPO (Liu et al., 2 Feb 2024), with the conditional listwise structure enforced through listwise rank losses (e.g., list-MLE, ListNet, or λ\lambda-weighted pairwise logistic) over model log-probs conditioned on a prompt.

3. Training Algorithm and Pseudocode

CLPO training involves generating candidate slates, computing rewards, and updating model parameters via the composite loss. The pseudocode below summarizes the Maestro CLPO update cycle:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
for each q^i in batch:
    # Step 1: Candidate generation
    C^i = {c^i_1, ..., c^i_m}  # Execution Agents sample N·K candidates

    # Step 2: Candidate rewards and advantage
    for k in 1..m:
        r^i_k = reward(c^i_k)  # correctness + rationale heuristic
    mean_r = sum_k r^i_k / m
    A^i_k = r^i_k - mean_r

    # Step 3: Decision loss
    L_choice^i = -sum_k A^i_k * log pi_theta(k | s^i)

    # Step 4: Justification generation and ranking
    s^i_k = mean_tau log pi_theta(y_{k, tau} | y_{k,1:tau-1}, s^i)
    order sigma^i by descending r^i_k
    L_reason_rank^i = -sum_j log (exp(s^i_{sigma_j}) / sum_{l=j}^m exp(s^i_{sigma_l}))

L_total = mean_i L_choice^i + lambda_rank * mean_i L_reason_rank^i + ...
update theta with Adam or related optimizer

Hyperparameters such as batch size, optimizer, KL/entropy weights, LoRA rank, and backbone model selection follow the published guidelines (Section 4).

4. Implementation Parameters and Workflow

CLPO has been primarily deployed atop LLaMA and Qwen LLM backbones using parameter-efficient fine-tuning (LoRA: rank 16, alpha 32, dropout 0.05), executed at a batch size of 32 across 4×A100 (80GB) GPUs with mixed precision (bfloat16). Training uses the Adam optimizer (β1=0.9, β2=0.999, ϵ=108\beta_1=0.9,\ \beta_2=0.999,\ \epsilon=10^{-8}), an initial learning rate of 5×1055\times 10^{-5} (cosine schedule), and gradient norm clipping at 1.0. Loss component weights generally use λrank=0.8\lambda_\text{rank}=0.8, λkl=0.1\lambda_\text{kl}=0.1, and λent=0.01\lambda_\text{ent}=0.01.

Candidate slate sizes typically result from NN execution agents each generating KK samples (m=NKm=N \cdot K), with N=3N=3, K=3K=3 in default configurations and up to R=3R=3 consensus rounds. Decoding for decision tokens is deterministic (temperature 0.0) while rationale generation is sampled with nucleus sampling (T=0.7, p=0.95T=0.7,\ p=0.95).

5. Empirical Results and Comparative Analysis

CLPO-equipped Maestro achieves notable improvements over prior multi-agent and single-agent LLM collaboration schemes. On LLaMA-8B, Maestro with CLPO attains 89.33% on GSM8K (vs. 88.67% with GRPO, 87.69% with SFT) and 28.52% on AMC (vs. 26.30% with GRPO). Across a suite of reasoning and factual tasks—including MATH, AIME, MMLU, and HumanEval—CLPO delivers average absolute gains of approximately 6% and up to 10% in some backbone/benchmark settings. Substitute of CLPO with GRPO or SFT yields consistent degradation (by 1–2% absolute).

Ablation studies show the essentiality of both LchoiceL_\text{choice} and Lreason_rankL_\text{reason\_rank}. Omitting either sharply degrades final performance; e.g., rationale-only ranking damages decision accuracy, while decision-only learning yields weak explanations. Optimal ranking weight λrank\lambda_\text{rank} lies in [0.5, 0.8]; lower or higher values reduce accuracy and rationale discrimination. Zero-shot prompting on GPT-4o-mini also benefits, with GSM8K improve from 94.89% (AgentPrune) to 95.60%.

Table: Benchmark Results — Maestro + CLPO, Multi-Agent Baselines, and Ablation (Yang et al., 8 Nov 2025)

System GSM8K (%) AMC (%) HumanEval Pass@1
Maestro + CLPO 89.33 28.52 see text
GRPO 88.67 26.30
SFT 87.69
Best prior (avg) ±6% lower

6. Relationship to Listwise LM Alignment Approaches

CLPO's listwise structure closely parallels the LiPO framework for LLM alignment (Liu et al., 2 Feb 2024). In LiPO, listwise ranks ψ1ψK\psi_1 \ldots \psi_K (normalized preference labels) are attached per prompt-response to induce a variety of ranking losses, including pairwise logistic (as in DPO_BT), pairwise hinge (SLiC_norm), list-MLE (Plackett–Luce), and a λ\lambda-weighted listwise loss that exploits both label magnitude and model-predicted rank. The llist-MLEl_\text{list-MLE} loss in LiPO:

llist‐MLE(ψ,s)=k=1Klogexp(sτ(k))j=kKexp(sτ(j))l_\text{list‐MLE}(\psi, s) = - \sum_{k=1}^K \log \frac{\exp(s_{\tau(k)})}{\sum_{j=k}^K \exp(s_{\tau(j)})}

where τ\tau is the sorted permutation of labels, is mathematically identical to the Lreason_rankL_\text{reason\_rank} component of Maestro CLPO.

LiPO-λ (lλl_\lambda), which weights pairwise losses by gain and rank-discount, yields consistent empirical improvements as list size increases and as compared to DPO_PL or SLiC_norm. Notably, only the λ\lambda-weighted loss (as opposed to constant or partial weights) improves monotonically up to K=8K=8 and outperforms prior pairwise and pointwise alignment methods in both reward-model and human evaluations. Results for Reddit TL;DR and AnthropicHH confirm that list-MLE/listwise losses enhance learning from comparative feedback.

7. Strengths, Limitations, and Future Directions

CLPO offers a clean separation between decision and rationale supervision, leading to improved credit assignment and stable convergence even with large or diverse candidate pools. Listwise supervision reduces noise compared to traditional sequence-level RL, and empirical results indicate superiority across model sizes, tasks, and collaboration strategies.

However, CLPO targets only the convergence (“synthesis”) phase in Maestro; the exploration (“generation”) agents persist as fixed, potentially missing additional joint optimization gains. Computational complexity scales as O(m2)O(m^2) for large slates in the worst case, constraining applicability to extensive candidate sets. The reward function r(c)r(c), which guides both selection and ranking, can itself be a source of bias or inadequacy if not carefully constructed.

Directions for future research include extending CLPO-style objectives to jointly optimize both execution and synthesis agents, dynamically adjusting slate sizes, end-to-end optimization across all roles, and broadening to open-ended or unstructured domains such as dialogue or planning. Improvements in reward elicitation and handling uncertainty in ranking are also noted as promising enhancements.

CLPO thus constitutes a foundational method for robust, reliable collaborative reasoning and policy alignment in large-scale multi-agent and single-model LLM systems, offering theoretical clarity and demonstrated empirical utility.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Conditional Listwise Policy Optimization (CLPO).