DMPO: Direct Multi-Turn Preference Optimization

Updated 1 November 2025

DMPO is a framework that extends single-turn DPO to multi-turn settings by capturing trajectory-level, multi-aspect preferences using state-action occupancy measures.
It introduces length normalization and discounting to mitigate bias in comparing trajectories of different lengths, ensuring fair optimization.
DMPO enables robust multi-turn reasoning in applications like dialogue and tool use, demonstrating strong empirical gains over traditional methods.

Direct Multi-Turn Preference Optimization (DMPO) is a class of algorithms and mathematical frameworks that extend the Direct Preference Optimization (DPO) paradigm from single-turn to multi-turn, trajectory-level, and multi-aspect alignment for LLMs and language agents in sequential or interactive tasks. DMPO is designed to address challenges unique to multi-step reasoning, dialogue, tool use, and real-world agent settings, where preference feedback and reward structures are complex, interdependent, and often span entire trajectories.

1. Theoretical Foundations and Mathematical Formulation

Direct Multi-Turn Preference Optimization seeks to generalize DPO—which typically optimizes a LLM’s policy using pairwise preference signals for single outputs—to settings involving multi-turn interactions represented as trajectories within a Markov Decision Process (MDP) framework. The central challenge is the extension of the Bradley-Terry (BT) likelihood model and KL-regularized objectives from reward modeling in RLHF to robust, unbiased optimization of full interactive sequences.

For a trajectory $\tau = (x, a_1, s_1, a_2, ..., a_T)$ consisting of prompt $x$ , states $s_t$ , and agent actions $a_t$ , DMPO replaces direct policy-based KL divergence with a state-action occupancy measure (SAOM) constraint. The occupancy measure $d^\pi(s, a)$ is given by

$d^\pi(s, a) = \frac{1-\gamma}{1-\gamma^T} \sum_{t=0}^{T-1}\gamma^t \mathbb{P}(s_t = s, a_t = a | \pi),$

where $\gamma$ is a discount factor and $T$ is the trajectory length.

The DMPO loss, incorporating SAOM and discount-based length normalization, for a preference pair (preferred trajectory $\tau^w$ of length $T_w$ and dispreferred trajectory $\tau^l$ of length $T_l$ ) is:

$L_{\mathrm{DMPO}} = -\mathbb{E}_{(s_0, \tau^w, \tau^l) \sim D} \log \sigma\left[ \sum_{t=0}^{T_w - 1} \phi(t, T_w) \frac{\pi_\theta(a_t^w | s_t^w)}{\pi_{ref}(a_t^w | s_t^w)} - \sum_{t=0}^{T_l - 1} \phi(t, T_l) \frac{\pi_\theta(a_t^l | s_t^l)}{\pi_{ref}(a_t^l | s_t^l)} \right],$

where $\phi(t, T) = \frac{1 - \gamma^{T-t}}{1 - \gamma^T}$ is the length discounting term (Shi et al., 21 Jun 2024). The critical theoretical advance, compared to single-turn DPO, is the elimination of the partition function dependence on state and consistent normalization across trajectories of unequal lengths, which mitigates both length bias and compounding distributional error.

In LLM-based recommendation and personalized ranking tasks, DMPO can further generalize to multi-negative sampling, with the objective:

$\mathcal{L}_\text{DMPO}(\pi_{\theta}; \pi_{ref}) = -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \left[\log \sigma \left(\beta\log \frac{\pi_{\theta}(y_w\mid x)}{\pi_{ref}(y_w\mid x)} - \frac{1}{k}\sum_{i=1}^{k}\beta \log \frac{\pi_{\theta}(y_{l,i} \mid x)}{\pi_{ref}(y_{l,i}\mid x)}\right)\right]$

(Bai et al., 25 May 2024).

2. Core Innovations over Single-Turn DPO

A. State-Action Occupancy Measure (SAOM) Constraints

DMPO replaces simple policy KL with occupancy measure constraints, capturing the visitation distribution over states and actions, and better aligning agent behavior with expert trajectories, especially for off-distribution states.

B. Length Normalization and Discounting

Multi-turn preference optimization introduces length and discount normalization into the Bradley-Terry model, addressing the problem where winning and losing trajectories often differ in length. This prevents longer trajectories (with more opportunities to accumulate log-probabilities) from dominating optimization and enables fairer comparison.

C. Hybrid and Multi-Preference Extensions

DMPO subsumes cases like Direct Multi-Preference Optimization, where multiple sub-preference aspects (helpfulness, honesty, etc.) each contribute their own pairwise labels, and a Preference Divergence (PD) term is introduced to quantify and control inter-aspect conflicts. The DMPO loss in this context is:

$L_\mathrm{DMPO}(\theta) = - \mathbb{E}_{z \sim D}\left[ \log \sigma\Big(M_\theta(z) + \Delta\phi_k(z)\Big) \right],$

where $\Delta\phi_k(z)$ reflects agreement or disagreement between sub-preferences (Zhang et al., 11 Aug 2025).

D. Trajectory-Level (MDP) Adaptations and Multi-Turn Environment Feedback

In applications such as tool-augmented mathematical reasoning or code agent construction, DMPO explicitly supports integration with deterministic environment feedback (e.g., code interpreter outputs), and the preference objective is computed only over agent-controlled tokens, not environmental messages (Xiong et al., 4 Sep 2024, Yu et al., 15 Sep 2025, Jung et al., 2 Apr 2025).

3. Empirical Results and Benchmark Performance

Experimental validation across both general agent tasks and specialized domains supports the empirical superiority of DMPO:

Dataset/Task	Baseline	DMPO or Variant	Improvement
WebShop (avg. reward, generalization)	DPO, PPO, etc.	DMPO	Robust OOD gains
ScienceWorld, ALFWorld (multi-step reasoning)	DPO, PPO, RFT	DMPO	Length-bias robustness
MovieLens, Amazon recsys (few-shot, cross-domain)	SFT, classic	DMPO	+8-13 AUC
GSM8K, MATH (math agents, pass@1)	SFT, DPO	Multi-Turn DPO	+5–7% pass rate
SWE-bench (code agents, solution coverage)	SFT, DPO, KTO	Entropic DMPO	SOTA open-weight
SOTOPIA (social intelligence, dialogue segments)	DPO, session	SDPO	SOTA, surpasses GPT-4o

DMPO consistently outperforms single-turn preference optimizers and RL (PPO, RFT) on multi-turn, out-of-distribution, and data-noise-prone tasks. It also provides strong generalization in few-shot or cross-domain scenarios and demonstrably mitigates compounding error due to early action divergence.

4. Comparative Table: Multi-Turn and Preference Extensions

Method	Distinctive Mechanism	Multi-turn Capability	Key Advantage
DMPO	SAOM, length normalization	Yes (trajectory)	Robustness, OOD stability
SDPO	Segment-level selection	Yes (key segments in dialogue)	Noise reduction, finer credit
M-DPO	Masked trajectory-level DPO	Yes (multi-turn, tool)	Tool/environment incorporation
DiaTool-DPO	Dialogue-state trajectory DPO	Yes (tool-augmented dialogue)	Rejection/slot-filling control
Entropy-DMPO	Policy entropy regularization	Yes (code, multi-step tasks)	Diversity for TTS
AgentQ, ETO	Exploration, process-level supervision	Yes	Adaptivity, discovery
MultiPref-DMPO	PD term, multi-aspect selection	Yes (aspect-based)	Robust multi-objective align.

5. Methodological Variants, Toolchains, and Practical Implementations

Trajectory Pairing and Data Construction

Preference data for DMPO requires matched pairs of entire trajectories. In dialogue and tool-use agents, negative trajectories are typically constructed by either simulating suboptimal behaviors (e.g., skipping slot-filling), reordering actions, or using synthetic perturbations. Recent works automate this with LLMs or tool-interpreters, reducing reliance on human annotators (Jung et al., 2 Apr 2025).

Masking of Non-learnable Tokens

In multi-turn reasoning with external tools or system feedback (e.g., code interpreters), loss computation masks (i.e., excludes from optimization) user or environment-generated tokens. This ensures that DMPO only optimizes over the agent’s own action space, preventing contamination from non-controllable transitions (Xiong et al., 4 Sep 2024).

Discount and Margins

The discount factor $\gamma$ and margin hyperparameters allow control over credit assignment: small $\gamma$ emphasizes early actions (reducing compounding errors if late steps are noisy), large $\gamma$ leverages whole-trajectory alignment if high-quality preference data is available.

Policy Entropy Augmentation and Test-Time Scaling

Entropy-regularized DMPO objectives are introduced to counteract "diversity collapse" in agentic tasks, especially for ensembles and test-time scaling (TTS) (Yu et al., 15 Sep 2025). Here, the objective is adapted to:

$\mathbb{E}_{x,\tau}[u(x, y) + \lambda H(\pi(\cdot|x)) - \beta D_{KL}(\pi || \pi_{ref})]$

This yields policies with greater output diversity for downstream trajectory selection and hybrid inference heuristics.

6. Extensions: Multi-Preference, Mixture, and Multi-Reference DMPO

The DMPO framework is extensible in several orthogonal directions:

Multi-Preference/Aspect DMPO: Incorporates fine-grained, aspect-specific preference signals via PD terms and theoretically grounded data selection principles for DPO training (Zhang et al., 11 Aug 2025).
Mixture-of-Experts and Multi-Reference DMPO: Adopts mixture policy frameworks (Mix/MoE-DPO, MRPO) using a virtual or gating-weighted reference model to enable modular, user- or context-adaptive alignment—crucial in multi-task and heterogeneous annotation settings (Bohne et al., 9 Oct 2025, Le et al., 26 May 2024).
Segment-Level DMPO: Optimizes not on whole trajectories but on dynamically selected critical segments (SDPO), balancing between overly local turn-level or noisy session-level learning (Kong et al., 3 Jan 2025).

7. Outstanding Challenges and Future Research Directions

Several open problems remain:

Data Efficiency and Preference Collection: High-quality multi-turn preference annotation is resource-intensive. Solutions include active learning, online agents, and synthetic or self-critique-based trajectory generation.
Safety and Robustness: Multi-turn agents are susceptible to model drift and unsafe behaviors over extended interactions. Dedicated safety constraints and segment-level correction are under exploration.
Interpretability: Decomposable reward/provenance tracking and aspect-wise preference signals are needed for multi-dimensional, interactive settings.
Multi-Modal and Real-World Integration: Extending DMPO to multimodal and embodied agents is nascent, with state-action occupancy and modular objectives as promising scaffolds (Liu et al., 12 Mar 2025).

DMPO, in its multiple instantiations and generalizations, forms the backbone for robust multi-turn preference alignment in contemporary LLM-based agent systems. Its emphasis on MDP-theoretic loss construction, length and entropy normalization, and segmental/occupancy constraints underpins both its theoretical soundness and its strong empirical performance across diverse real-world deployment scenarios.