Multi-turn Preference Optimization

Updated 10 November 2025

Multi-turn Preference Optimization (MTPO) is a framework that aligns language agents with human preferences by optimizing entire interaction trajectories rather than isolated responses.
It formulates decision-making as a Markov Decision Process, using trajectory scoring functions, discounting schemes, and pairwise comparisons to calibrate policy decisions.
MTPO improves error correction, tool invocation, and strategic planning across applications like dialogue, coding, and recommendation, yielding measurable performance gains.

Multi-Turn Preference Optimization (MTPO) is a framework for aligning language agents—especially LLMs—with human preferences in multi-turn, temporally extended decision-making tasks. Rather than optimizing for individual responses, MTPO operates over entire interaction trajectories or targeted trajectory segments. This orientation enables robust preference alignment, tool invocation control, error correction, and strategic planning in domains ranging from dialogue agents to tool-augmented reasoning, embodied tasks, coding, and recommendation.

1. Formal Principles of MTPO

MTPO generalizes standard single-turn preference optimization (such as Direct Preference Optimization, DPO) to settings where the agent’s decision-making unfolds over multiple dialogue turns, action steps, or session segments. The technical essence of MTPO is to optimize policies via trajectory- or segment-wise preference comparison, under constraints appropriate for Markov Decision Processes (MDPs) or more general sequential environments.

Markov Decision Process Formulation

The decision process is represented as an MDP $(S, A, P, R, \gamma)$ :

$S$ : set of states (e.g., dialogue context, tool selection stage, environment).
$A$ : allowed actions per state (slot-filling queries, tool calls, completions, etc.).
$P(s'|s, a)$ : state transition function, often deterministic in tool-augmented settings.
$R(s, a, s')$ : reward (may be implicit, arising from log-probability ratios or human preferences instead of explicit scalars).
$\gamma$ : discount factor, controlling the weighting of early vs. late trajectory segments.

2. Objective Functions and Losses

The core goal of MTPO is to ensure that a learned policy assigns higher probability to preferred (human-aligned) trajectories than to dispreferred ones. This is formalized via a pairwise trajectory comparison—typically cast as a Bradley–Terry–Luce (BT) model over utilities or log-likelihood ratios.

Generalized Objective

$L_{\text{MTPO}}(\theta) = -\mathbb{E}_{(\tau^+, \tau^-)} \left[ \log \sigma \big( S_\theta(\tau^+) - S_\theta(\tau^-) \big) \right]$

where $S_\theta(\tau)$ is a trajectory scoring function, often a sum of per-turn log-probability margins between the learned policy $\pi_\theta$ and a reference policy $\pi_\text{ref}$ :

$S_\theta(\tau) = \sum_{t=0}^{T-1} w_t \log \frac{\pi_\theta(a_t | s_t)}{\pi_\text{ref}(a_t | s_t)}$

with $w_t$ as discounting or normalization weights (see below).

Turn/Segment Weighting: Discounting ( $\gamma$ ), normalization ( $\phi(t, T)/\psi(T)$ ), and reward-margin thresholds ( $\rho$ ) are frequently introduced to neutralize biases from variable trajectory lengths and amplify learning on critical trajectory sections (Jung et al., 2 Apr 2025, Shi et al., 21 Jun 2024).
Segment-level Loss: SDPO focuses loss computation on a short segment (identified via error localization), driving targeted optimization while minimizing training noise (Kong et al., 3 Jan 2025).
Trajectory, Turn-Level and Hybrid Objectives: Methods such as PGPO combine trajectory-level RL (e.g., PPO-based policy updates) with dense intra-trajectory pairwise preference loss, facilitating fine-grained credit assignment and improved sample efficiency (Wang et al., 26 Sep 2025).
Entropy Regularization: To preserve response diversity, e.g., in coding or tool-use contexts with test-time scaling, entropy-augmented objectives are incorporated (Yu et al., 15 Sep 2025).

3. Dataset Construction and Trajectory Pairing Strategies

Dataset construction under MTPO must yield reliable positive (preferred) vs. negative (dispreferred) trajectory (or segment) pairs corresponding to the same user query or context. Strategies include:

Automatic Augmentation: For tool-augmented LLMs, begin from existing function-calling datasets, generate incomplete queries, and produce synthetic slot-filling or rejection flows using strong LLMs (e.g., GPT-4o) (Jung et al., 2 Apr 2025).
Trajectory Segment Mining: Identify minimal error-causing segments in low-scoring dialogues, extract matched-length "win"/"lose" pairs for focused optimization (Kong et al., 3 Jan 2025).
Simulation-Based Feedback: Employ user simulators (e.g., AILO) or deterministic code interpreters to automate preference feedback and satisfaction diagnostics (Feng et al., 17 Jun 2025, Xiong et al., 4 Sep 2024).
Oracle and Tool Feedback: In mathematical or lead-optimization settings, use deterministic toolchains (code interpreters, molecular property oracles) for evaluating final solutions as well as collecting dense pairwise preferences by intra-trajectory ranking (Wang et al., 26 Sep 2025, Xiong et al., 4 Sep 2024).

4. Algorithmic and Architectural Considerations

Optimization Procedures

Batching and Update Rules: Training typically proceeds by batching multiple $(\tau^+, \tau^-)$ pairs, summing or averaging per-turn log-ratios (possibly with discounting), and applying AdamW or similar optimizers (Jung et al., 2 Apr 2025, Kong et al., 3 Jan 2025).
Mirror Descent and Occupancy Constraints: Some approaches optimize over trajectories' state-action occupancy measures to enable length-normalized, partition function–canceling training, thus avoiding instability endemic to direct per-step DPO extensions (Shi et al., 21 Jun 2024).
Optimistic Online Updates and Markov Games: OMPO and similar algorithms model policy competition as a Markov game, using optimistic mirror descent for fast convergence to Nash equilibria under general, possibly non-transitive, human preferences (Wu et al., 18 Feb 2025).

Model Architecture and Inputs

Tool-Augmented Inputs: The state $s_t$ encodes available tools (as JSON schemas), slot-filling status, and full dialogue history for every action.
Multi-modality: In multimodal contexts, image-text pairs, editable visual representations, and interleaved context windows are serialized and embedded for training both preference models and agent policies (Chen et al., 29 May 2025).

5. Empirical Evaluation and Impact

MTPO-driven models are systematically evaluated using domain-specific testbeds and benchmarks with granular, high-level, and holistic alignment criteria.

Domain / Task	MTPO Variant	Baseline	MTPO Result	Comparison / Gain
Tool-augmented Dialog	DiaTool-DPO	SFT	Slot: 0.639 → 0.917<br> Relevance: 0.826 → 0.913	94.8% of GPT-4o slot perf.; +44 pp over SFT (Jung et al., 2 Apr 2025)
Social Dialogue	SDPO (Segment)	Session/Turn DPO	Goal: 8.56/3.69 (self-chat)	Outperforms all prior methods; best on "hard" data (Kong et al., 3 Jan 2025)
Molecular Lead Opt.	PGPO	PPO, DPO, etc.	84% SR (single) / 50% (multi)	≥2.3× over best baseline; high sample efficiency (Wang et al., 26 Sep 2025)
Coding Agents	entropy-MTPO	SFT, DPO, KTO	Up to 59.4% pass@1 (30B)	1st open-weight model on SWE-bench Lite (Yu et al., 15 Sep 2025)
Education Dialogue	mirror-descent MTPO	RLHF, PPO	Wins >65% in pref. tests	Provable convergence, near reward-RL with only preferences (Shani et al., 23 May 2024)
CONV. Recommendation	ECPO	SFT, Tree-MTPO	WR: 0.57 vs 0.48, 55% less overhead	SOTA turn-level alignment, minimal LLM calls (Feng et al., 17 Jun 2025)
Tool-integrated Math	Multi-Turn DPO/KTO	SFT	GSM8K: 84.1 → 86.3%<br>MATH: 51.0 → 54.5%	+5–6 pts over SFT; beats larger SFT-only baselines (Xiong et al., 4 Sep 2024)
Planning+Preference	COMPASS MTPO	-	Acceptable/Optimal Gap: >20%	Real-world, multi-tool, constraint/prference (Qin et al., 8 Oct 2025)

Strong empirical performance is consistently obtained by incorporating preference signals across multi-turn choices, managing trajectory length effects, and incorporating architectural priors reflecting domain structure.

6. Key Theoretical and Practical Insights

MTPO unifies several theoretical advances for sequential decision-making with human- or oracle preferences:

Partition Function Handling: Transitioning from per-step to occupancy-based or segment-level normalization enables efficient multi-turn training (Shi et al., 21 Jun 2024, Kong et al., 3 Jan 2025).
Segment vs. Whole-Trajectory Learning: Segment-level methods reduce training noise and focus optimization signal on error-prone regions, accelerating convergence and improving generalization (Kong et al., 3 Jan 2025).
Entropy Preservation: In open-ended or search-heavy domains, explicit entropy bonuses prevent mode collapse, allowing effective test-time scaling and parallel trajectory evaluation (Yu et al., 15 Sep 2025).
Sample Efficiency: By extracting dense pairwise feedback from within-traj rankings (quadratically many per trajectory), MTPO variants dramatically increase learning signals per expensive oracle call (Wang et al., 26 Sep 2025).
Provable Guarantees: Game-theoretic and mirror-descent approaches yield finite-sample convergence bounds to Nash equilibria under minimal occupancy and regularity assumptions, even with general (non-transitive) preferences (Shani et al., 23 May 2024, Wu et al., 18 Feb 2025).

7. Extensions, Limitations, and Future Directions

MTPO has demonstrated robust capability in agent alignment, but notable challenges and avenues remain:

Non-stationary and Human Preferences: Stability under shifting human standards, distributional mismatch between simulators and real users, and generalization to under-specified or ambiguous goals require further research (Feng et al., 17 Jun 2025, Qin et al., 8 Oct 2025).
Scaling to High-Dimensional, Real-World Tasks: Most current MTPO works address moderately sized models and synthetic environments. Scaling to truly web-scale planning, tool use in-the-wild, and massive codebases is only beginning (Yu et al., 15 Sep 2025, Qin et al., 8 Oct 2025).
Unified Preference Collection: Efficient, low-overhead dataset construction—potentially via real-time interleaving of human feedback and model generation—is crucial for practical deployment (Feng et al., 17 Jun 2025, Chen et al., 29 May 2025).
Integration with Broader RLHF Stacks: Combining MTPO with reward modeling methodologies, token-level feedback, and model editing further broadens its applicability (Shi et al., 21 Jun 2024, Kong et al., 3 Jan 2025).
Multi-modal and Interleaved Contexts: Rapid expansion to audio–visual, embodied, and interleaved text–image–action domains is underway, driven by dataset and framework innovations (Chen et al., 29 May 2025).

In summary, MTPO provides the mathematical, algorithmic, and empirical foundations for robust, long-horizon preference alignment in language- and tool-augmented agents, anchoring the next generation of LLM applications across dialogue, planning, recommendation, embodied reasoning, coding, and beyond.