Multi-Turn Preference Optimization

Updated 4 March 2026

MTPO is a framework that aligns language agents by optimizing multi-turn decision trajectories using cumulative reward signals in finite-horizon MDPs.
It employs occupancy measure regularization and a length-normalized Bradley–Terry model to offset biases from varying trajectory lengths and ensure stable optimization.
Empirical results show that MTPO outperforms single-turn methods, enhancing agent performance in dialogue control, sequential planning, and tool usage tasks.

Multi-Turn Preference Optimization (MTPO) is an advanced framework for aligning language agents and LLMs with human or task-specific preferences over extended interactions. Unlike single-turn preference optimization—which focuses on pointwise or isolated decision steps—MTPO operates over multi-step trajectories within Markov Decision Processes (MDPs), directly optimizing for agent behaviors that maximize cumulative reward or preference signals at the trajectory or segment level. This paradigm has catalyzed substantial advances in agent alignment, dialogue control, tool usage, and sequential planning.

1. Formalization and Motivation

MTPO is formulated in the context of finite-horizon MDPs, where a policy $\pi_\theta$ generates action sequences $(a_0, \ldots, a_{T-1})$ in response to evolving textual states $(s_0, \ldots, s_{T-1})$ . A trajectory $\tau = (s_0, a_0, \ldots, s_{T}, a_{T})$ accumulates a discounted reward $E[\sum_{t=0}^{T-1}\gamma^t r(s_t,a_t)]$ under transition dynamics. The essential challenge is to directly optimize policies using preference data—typically, labeled pairs of full multi-turn trajectories $(\tau^w, \tau^l)$ marked as “win” (preferred) and “lose” (dispreferred)—as a substitute for explicit reward models or hand-crafted feedback.

Classic Direct Preference Optimization (DPO) is provably effective in single-turn or myopic settings, leveraging a Bradley–Terry probabilistic model for pairwise preferences. However, applying DPO naively to multi-turn tasks fails due to two core obstacles:

The partition function $Z(s_t)$ in the MaxEnt-optimal policy is state- and step-dependent, thus cannot be cancelled between differing trajectories.
Preference trajectory pairs $(\tau^w, \tau^l)$ commonly differ in length ( $T_w \ne T_l$ ), introducing unavoidable additive biases proportional to length disparities, which classical DPO cannot normalize away.

This constrains the expressivity and robustness of single-turn preference optimization in sequential environments, motivating new approaches that properly aggregate and normalize preference signals over whole-agent behaviors (Shi et al., 2024).

2. Core Methodological Advances and Loss Construction

2.1 Occupancy Measure Regularization

MTPO generalizes DPO by lifting optimization from raw policy distributions to discounted state–action occupancy measures (SAOMs). Instead of per-step policy KL penalties, the regularized RL objective is expressed as:

$\max_{\pi_\theta} \mathbb{E}_{(s,a)\sim d^{\pi_\theta}}[r(s,a)] - \beta\, \mathbb{D}_{\mathrm{KL}}[d^{\pi_\theta} \| d^{\pi_{\mathrm{ref}}}]$

where $d^{\pi}(s,a)$ is the normalized, discounted occupancy measure, and $\pi_{\mathrm{ref}}$ is a fixed reference policy (e.g., a supervised fine-tuned baseline). Crucially, in the MaxEnt RL framework, the ratio $d^{\pi^*}/d^{\pi_{\mathrm{ref}}}$ eliminates any residual partition function dependence, removing state- and trajectory-length biases (Shi et al., 2024).

2.2 Length-normalized Multi-turn Bradley–Terry Model

To convert preference supervision over whole trajectories into a practical loss, MTPO introduces a length-normalized Bradley–Terry likelihood:

$p(\tau^w \succ \tau^l) = \sigma\left[ \frac{1-\gamma}{1-\gamma^{T_w}} \sum_{t=0}^{T_w-1} \gamma^t r(s_t^w, a_t^w) - \frac{1-\gamma}{1-\gamma^{T_l}} \sum_{t=0}^{T_l-1} \gamma^t r(s_t^l, a_t^l) \right]$

This normalization re-weights each turn’s contribution and ensures equitable comparison across different-length trajectories. The practical implementation then substitutes $r(s,a)$ with its occupancy-measure-based closed form, neutralizing partition function artifacts (Shi et al., 2024).

2.3 Closed-Form MTPO Loss and Implementation

The complete MTPO (DMPO) loss is:

$\mathcal{L}_{\mathrm{DMPO}} = -\mathbb{E}_{(\tau^w, \tau^l)} \log \sigma \left[ \sum_{t=0}^{T_w-1} \phi(t,T_w) \frac{\pi_\theta(a_t^w|s_t^w)}{\pi_{\mathrm{ref}}(a_t^w|s_t^w)} - \sum_{t=0}^{T_l-1} \phi(t,T_l) \frac{\pi_\theta(a_t^l|s_t^l)}{\pi_{\mathrm{ref}}(a_t^l|s_t^l)} \right]$

with $\phi(t, T) = (1-\gamma^{T-t})/(1-\gamma^T)$ prioritizing early decisions. The optimization can be implemented efficiently using gradient steps over minibatches of trajectory pairs, requiring only likelihood ratios from current and reference models; no supplementary reward model is needed during training (Shi et al., 2024). Pseudocode for a gradient-update is:

for (s0, τw, τl) in minibatch D:
    Vw = sum(φ(t,T_w) * (π_θ(a_t^w|s_t^w)/π_ref(a_t^w|s_t^w)) for t in range(T_w))
    Vl = sum(φ(t,T_l) * (π_θ(a_t^l|s_t^l)/π_ref(a_t^l|s_t^l)) for t in range(T_l))
    loss += -log σ(Vw - Vl)
update θ via ∇_θ loss

3. Theoretical Properties and Insights

The MTPO (DMPO) formulation yields several rigorous properties:

Compounding error mitigation: The time-dependent reweighting $\phi(t,T)$ decreases with $t$ , so earlier state-actions are emphasized. This directly addresses the compounding error problem that plagues naive per-turn objectives.
Single-turn reduction: In the limit $\gamma \to 0$ , $\phi(0,T)=1$ and all other weights vanish, so DMPO collapses exactly to the single-turn DPO objective.
Adaptive weighting: The sigmoidal BT likelihood automatically amplifies loss on trajectory pairs where the model is uncertain.
Stability: DMPO is a principled saddle-point approximation to the regularized maximum-entropy RL objective and retains empirical optimization stability. No formal global convergence proof is provided due to nonconvexity, but loss curves are well-behaved in extensive experiments (Shi et al., 2024).

4. Empirical Evidence and Benchmarks

DMPO’s effectiveness has been extensively validated on MDP-style agent tasks:

Datasets: Evaluated on WebShop (shopping/task completion, [0,1] reward), ScienceWorld (science-experiment execution, [0,1] subgoal reward), and ALFWorld (embodied household tasks, binary completion reward). All environments utilize ReAct-inspired text observation/action schemas.
Baselines: Compared to no tuning, supervised fine-tuning (SFT), best-of- $N$ sampling, rejection-sampling fine-tune (RFT), PPO, ETO (trial-and-error + DPO), and single-turn DPO.
Main findings:
- In noisy settings (base LlamA-2-7B-Chat, Mistral-7B), DMPO outperforms single-turn DPO by 2–5 points mean reward.
- In clean settings (Llama-2-7B, WebShop/ScienceWorld) DMPO outperforms ETO, PPO, SFT, and best-of- $N$ (0.701/0.724 vs. 0.698/0.685 for ETO), and surpasses GPT-3.5.
- Robustness under noisy or length-imbalanced preference data is demonstrated; single-turn DPO deteriorates when “lose” trajectories lengthen, but DMPO remains robust due to explicit normalization.
Ablations: Key hyperparameters ( $\gamma$ , normalization functions) are probed. Smaller $\gamma$ is favored under noisy data, higher $\gamma$ for cleaner settings. Length normalization is critical for performance stability (Shi et al., 2024).

5. Specializations and Comparative Methods

Numerous extensions and domain-specific variants of MTPO have emerged:

Dialog and Tool-Augmented Settings: DiaTool-DPO adapts multi-turn DPO with length and margin normalization in tool-augmented LLMs, demonstrating strong slot-filling and rejection capabilities over baselines (Jung et al., 2 Apr 2025).
Segment-Level Optimization: SDPO restricts loss computation to minimal “error” segments in dialogue (as determined by LLMs), dramatically reducing label noise and demonstrating superior performance on social agent benchmarks (Kong et al., 3 Jan 2025).
Iterative Policy Optimization: Iterative PPO recasts the multi-turn RLHF problem into an alternation between fitting a multi-turn Q-function and performing single-turn PPO updates, simplifying implementation while ensuring policy improvement (Jiang et al., 26 Nov 2025).
Preference-Guided Molecular Optimization: In molecular lead optimization, PGPO combines trajectory-level reinforcement learning with intra-trajectory pairwise preference learning, harvesting direct supervision at both granularities and achieving state-of-the-art sample efficiency (Wang et al., 26 Sep 2025).
Multimodal and Math Agents: Other works integrate multi-turn preference optimization with multimodal (vision–language) settings (Chen et al., 29 May 2025) and chain-of-thought math agents (Xiong et al., 2024), demonstrating consistent improvements over fine-tuning and single-turn preference methods.

A summary table of representative techniques:

Method	Domain/Setting	Key Features	Benchmark Gains
DMPO (Shi et al., 2024)	Language agents	SAOM normalization, trajectory-level BT	2–5 points avg. reward
DiaTool-DPO (Jung et al., 2 Apr 2025)	Tool-augmented LLM	State-mapped DPO, margin, turn-norm	44% slot accuracy↑
SDPO (Kong et al., 3 Jan 2025)	Social dialogue	Segment-level loss, expert segmenting	SOTOPIA SOTA
Iterative PPO (Jiang et al., 26 Nov 2025)	Conversational RL	Q-function reduction, single-turn PPO	10–15% conv. rate↑
PGPO (Wang et al., 26 Sep 2025)	Chemistry	Dual-level (traj + intra-traj) pref	2.3× best baseline

All values as reported in respective references.

6. Practical Considerations, Limitations, and Future Work

Key practical insights for MTPO deployment include:

Early decision weighting via $\phi(t,T)$ mitigates downstream error accumulation, promoting expert-like behavior traces.
Robustness to noisy, imbalanced, and varying-length preference data is a direct benefit of length normalization.
Minimal infrastructure is required: only instantaneous likelihood ratios are needed; there is no need to train auxiliary reward models.
Hyperparameters to tune include $\beta$ (policy divergence), $\gamma$ (time discounting), and, for some variants, margin or segment-length settings.
Principal limitations: Demonstrated predominantly on 7B-parameter models, with reward signals at turn or trajectory granularity. Extending to token-level objectives and scaling to larger models remain open topics.
Future directions involve richer behavioral granularity (token-level, multi-tool actions), more sophisticated segment selection (via learned or human-in-the-loop processes), multi-agent or adversarial settings, and expansion to further domains such as tool-use, planning, and multi-modal reasoning (Shi et al., 2024).

7. Significance in Sequential Learning and Agent Alignment

MTPO stands as a theoretically principled and empirically validated framework for aligning agentic LLMs with desired long-horizon, sequential behaviors. By addressing the structural limitations of per-turn objectives, introducing length and trajectory normalization, and generalizing to occupancy measures and segment-level preferences, MTPO achieves both robust theoretical justifications and substantial empirical gains across reinforcement learning tasks, tool-use, and complex agentic domains (Shi et al., 2024, Jung et al., 2 Apr 2025, Kong et al., 3 Jan 2025, Jiang et al., 26 Nov 2025). Its architectural and methodological flexibility enables its deployment in diverse agentic systems, providing a scalable foundation for dynamic, user-aligned autonomous agents.