Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Direct Multi-Turn Preference Optimization for Language Agents (2406.14868v4)

Published 21 Jun 2024 in cs.CL and cs.LG

Abstract: Adapting LLMs for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on three multi-turn agent task datasets confirm the effectiveness and superiority of the DMPO loss. The code is available at https://github.com/swt-user/DMPO.

Introduction

Direct Multi-Turn Preference Optimization (DMPO) is a novel loss function designed to adapt LLMs for complex, multi-step agent tasks (Shi et al., 21 Jun 2024 ). Developing language agents capable of coherent long-term planning and interaction is challenging. While Direct Preference Optimization (DPO) offers advantages over traditional Reinforcement Learning (RL) methods by directly optimizing based on preference data, its application to multi-turn scenarios is hindered by theoretical and practical obstacles. DMPO specifically addresses these challenges, enabling more effective training of language agents for tasks requiring sequential decision-making.

1. Background and Motivation

LLMs excel at various language tasks, making them promising candidates for controlling agents in interactive environments. However, adapting LLMs for agent tasks using standard RL techniques like PPO often faces issues with stability, sample complexity, reward engineering, and compounding errors, especially in long multi-turn interactions.

DPO emerged as a simpler and more stable alternative, directly optimizing policies using pairwise preference data (e.g., preferred vs. dis-preferred trajectories) without explicit reward modeling. This bypasses complex reward shaping and can reduce compounding errors. However, the standard DPO formulation relies on assumptions that break down in multi-turn settings.

2. Challenges in Multi-Turn Preference Optimization

Applying DPO directly to multi-turn agent tasks encounters two primary challenges:

  • State-Dependent Partition Function: In the standard DPO derivation for single-turn tasks, the partition function (a normalization term in the policy distribution) cancels out, simplifying the optimization. In multi-turn scenarios, the state evolves, making the partition function state-dependent. This dependence prevents the straightforward cancellation, complicating the direct optimization objective.
  • Trajectory Length Disparities: Preferred and dis-preferred trajectories in multi-turn tasks often have different lengths. Standard preference models like the Bradley-Terry model, used in DPO, can be biased by these length differences. For instance, a model might favor shorter or longer trajectories irrespective of quality, simply because of how rewards accumulate or how probabilities are calculated over sequences of different lengths.

The standard DPO loss is defined as:

LDPO(θ)=E(x,yw,yl)D[logσ(β(logπθ(ywx)logπθ(ylx)logπref(ywx)+logπref(ylx)))]L_{DPO}(\theta) = -E_{(x, y_w, y_l) \sim D}[\log \sigma(\beta (\log \pi_\theta(y_w|x) - \log \pi_\theta(y_l|x) - \log \pi_{ref}(y_w|x) + \log \pi_{ref}(y_l|x)))]

where ywy_w is the preferred (winning) and yly_l is the dis-preferred (losing) trajectory, πθ\pi_\theta is the policy, πref\pi_{ref} is the reference policy, and β\beta scales the reward. The inability to cancel the partition function (implicit in the logπ\log \pi terms) and the varying lengths of ywy_w and yly_l necessitate modifications for multi-turn settings.

3. DMPO Methodology

DMPO introduces two key modifications to address the challenges of multi-turn preference optimization:

  • State-Action Occupancy Measure (SAOM) Constraint: Instead of the standard policy constraint used in deriving DPO, DMPO utilizes a constraint based on the state-action occupancy measure. This measure represents the distribution of state-action pairs visited by a policy. Constraining the SAOM encourages the learned policy to stay closer to the distribution of states and actions seen in the preference data (often implicitly representing expert behavior), which helps mitigate compounding errors, especially when encountering novel states during long interactions.
  • Length Normalization in Bradley-Terry Model: To handle trajectory length disparities, DMPO incorporates length normalization directly into the preference probability calculation. The Bradley-Terry model probability is adjusted:

    pθ(yw>ylx)=exp(βrθ(x,yw)Lw)exp(βrθ(x,yw)Lw)+exp(βrθ(x,yl)Ll)p_\theta(y_w > y_l | x) = \frac{\exp(\beta \frac{r_\theta(x, y_w)}{L_w})}{\exp(\beta \frac{r_\theta(x, y_w)}{L_w}) + \exp(\beta \frac{r_\theta(x, y_l)}{L_l})}

    where rθ(x,y)r_\theta(x, y) is the implicit reward associated with trajectory yy given context xx, and LwL_w and LlL_l are the lengths of the preferred and dis-preferred trajectories, respectively. This normalization ensures a fairer comparison by considering the reward per step, making the preference less sensitive to absolute trajectory length.

Combining these elements, the DMPO loss function aims to maximize the log-likelihood of the length-normalized preference probabilities under the SAOM constraint:

LDMPO(θ)=E(x,yw,yl)D[logexp(βrθ(x,yw)Lw)exp(βrθ(x,yw)Lw)+exp(βrθ(x,yl)Ll)]L_{DMPO}(\theta) = -E_{(x, y_w, y_l) \sim D}[\log \frac{\exp(\beta \frac{r_\theta(x, y_w)}{L_w})}{\exp(\beta \frac{r_\theta(x, y_w)}{L_w}) + \exp(\beta \frac{r_\theta(x, y_l)}{L_l})}]

The paper (Shi et al., 21 Jun 2024 ) shows theoretically that this length normalization allows the partition function to become independent of the current state under certain assumptions, resolving the primary obstacle in applying DPO to multi-turn tasks. Furthermore, DMPO reweights state-action pairs using a discount function ϕ(t,T)\phi(t, T) that prioritizes earlier steps in trajectories, potentially further stabilizing learning.

4. Implementation and Experimental Setup

DMPO's effectiveness was evaluated empirically across several multi-turn agent task datasets.

  • Datasets: Experiments were conducted on datasets like WebShop (simulated e-commerce tasks), ScienceWorld (text-based science environment interaction), and ALFWorld (simulated household tasks based on language instructions). The earlier draft also mentioned MetaCraft (open-ended building and interaction) and Mind2Web (real website interaction), though the final set appears to be WebShop, ScienceWorld, and ALFWorld based on the "Detailed Analysis Summary". These datasets represent diverse challenges in multi-turn interaction, planning, and language understanding.
  • Baselines: DMPO was compared against various methods, including:
  • Implementation Details: The experiments likely used LLMs like Llama (e.g., 7B parameter models mentioned in the analysis summary) as the base policy. Training involved standard optimization techniques (e.g., Adam optimizer) with typical hyperparameters (learning rate, batch size) tuned for each dataset. Preference data consisted of pairs of winning and losing trajectories.
  • Evaluation Metrics: Performance was measured using task-specific metrics such as success rate, task completion rate, score, or reward obtained by the agent.

5. Results and Analysis

The empirical evaluation yielded several key findings:

  • Performance: DMPO consistently outperformed baseline methods, including standard DPO, across the evaluated datasets. It achieved state-of-the-art results in clean data settings and demonstrated particular robustness in noisy preference data settings.
  • Robustness: DMPO showed greater robustness to noise, especially regarding the length of the losing trajectories. The length normalization component makes DMPO less sensitive to length disparities compared to standard DPO.
  • Ablation Studies: Analyses confirmed the importance of both the SAOM constraint and length normalization. The studies also investigated the impact of the discount factor (γ\gamma) used in reweighting steps, finding that smaller values (prioritizing earlier steps) were beneficial in noisy settings, while larger values were better in clean settings.
  • Theoretical Justification: The paper provides theoretical arguments supporting the DMPO formulation. It demonstrates how length normalization addresses the partition function issue and how the SAOM constraint helps mitigate compounding errors more effectively than traditional policy constraints. It also shows that DMPO converges to the standard single-turn DPO loss under specific conditions (e.g., discount factor approaching zero).

6. Limitations and Future Work

Despite promising results, the paper acknowledges some limitations and suggests future research directions:

  • Task Formulation: The experiments primarily used a turn-wise task formulation, which can lead to sparse rewards or feedback signals. Exploring DMPO with denser feedback mechanisms could be beneficial.
  • Model Scale and Datasets: Experiments were conducted mainly on 7B parameter models and simulated datasets. Scaling DMPO to larger models and evaluating it on more complex, real-world tasks remains an important next step.
  • Computational Cost: The computational requirements for training with DMPO compared to other methods were not explicitly detailed but could be a factor in practical deployments.
  • Alternative Normalization: Exploring alternative methods for length normalization or handling trajectory length disparities could further improve performance.
  • Integration with Other Techniques: Combining DMPO with techniques like memory augmentation or hierarchical RL might lead to more capable language agents for even longer and more complex tasks.

7. Conclusion

DMPO offers a principled and effective approach for adapting LLMs to multi-turn agent tasks using preference data (Shi et al., 21 Jun 2024 ). By introducing state-action occupancy measure constraints and length normalization, DMPO overcomes key limitations of standard DPO in sequential settings. The strong empirical results and theoretical grounding suggest that DMPO is a valuable tool for developing more robust and capable language agents capable of handling complex, interactive scenarios requiring long-term planning and interaction.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Wentao Shi (22 papers)
  2. Mengqi Yuan (2 papers)
  3. Junkang Wu (19 papers)
  4. Qifan Wang (129 papers)
  5. Fuli Feng (143 papers)