Papers
Topics
Authors
Recent
Search
2000 character limit reached

Direct Multi-Turn Preference Optimization for Language Agents

Published 21 Jun 2024 in cs.CL and cs.LG | (2406.14868v5)

Abstract: Adapting LLMs for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on three multi-turn agent task datasets confirm the effectiveness and superiority of the DMPO loss. The code is available at https://github.com/swt-user/DMPO.

Summary

  • The paper presents DMPO that directly optimizes reinforcement learning objectives by replacing policy constraints with state-action occupancy measures in multi-turn tasks.
  • It incorporates length normalization into the Bradley-Terry model, balancing trajectory lengths and mitigating compounding errors in noisy and clean settings.
  • Experimental results show DMPO outperforms traditional methods like DPO and PPO, achieving higher rewards and more robust agent performance.

Direct Multi-Turn Preference Optimization for Language Agents

The paper "Direct Multi-Turn Preference Optimization for Language Agents" presents a novel loss function, DMPO, which directly optimizes reinforcement learning objectives in multi-turn scenarios. This study specifically addresses the challenges in applying Direct Preference Optimization (DPO) to multi-turn tasks, focusing on overcoming the inability to cancel the partition function and addressing length disparities between preferred and dis-preferred trajectories.

Introduction and Motivations

Developing language agents capable of solving complex tasks has emerged as a key objective within the AI community. While Behavioral Cloning (BC) techniques offer quick adaptation of LLMs to agent tasks, they are prone to compounding errors due to the accumulation of minor deviations in non-deterministic environments. In contrast, Direct Preference Optimization (DPO) excels in single-turn preference alignment but exhibits suboptimal performance in multi-turn settings due to its reliance on state-dependent partition functions.

The study introduces DMPO as a direct optimization method for reinforcement learning objectives in multi-turn scenarios. The key innovation involves replacing the policy constraint with a state-action occupancy measure constraint and incorporating length normalization into the Bradley-Terry model. This approach effectively addresses the partition function dependency on the current state, thereby facilitating more robust adaptation to multi-turn agent tasks.

Methodology

State-Action Occupancy Measure (SAOM)

The discounted state-action occupancy measure dπ(s,a)d^\pi(s, a) characterizes the likelihood of state-action pairs encountered under a given policy π\pi. By utilizing the SAOM constraint instead of policy constraints, DMPO mitigates compounding errors more effectively. The SAOM constraint focuses on distributions of state-action pairs that approximate expert trajectories, thereby facilitating error correction when deviating from expert paths. Figure 1

Figure 1: Illustration of expert trajectories and trajectories learned under the constraints of policy and state-action occupancy measure.

Derivation of DMPO Loss

DMPO directly optimizes the RL objective by canceling the partition function through SAOM constraints. The new RL objective is formulated as: $\max_{\pi_\theta} \mathbb{E}_{(s,a)\sim d^{\pi_\theta}(s,a)}[r(s,a)] - \beta \mathbb{D}_{KL}[d^{\pi_\theta}(s,a)||d^{\pi_{ref}(s,a)]$ Utilizing the length normalization technique, DMPO extends DPO to multi-turn scenarios by adjusting weights on state-action pairs based on a discount function Ï•(t,T)\phi(t,T). This reweighting is critical for balancing trajectory lengths in preference learning. Figure 2

Figure 2: Illustration of DMPO loss, which directly optimizes the RL objective by maximizing the likelihood of the preferred trajectory over the dispreferred trajectory.

Experiments and Results

Robustness in Noisy Settings

In experiments conducted under noisy conditions, DMPO demonstrated superior robustness compared to DPO by effectively mitigating the effects of noise through early-stage preference prioritization.

Performance in Clean Settings

Experiments with high-quality preference data revealed that DMPO consistently achieved higher rewards than traditional methods like PPO and SFT, particularly due to its ability to reduce compounding errors and optimize multi-turn trajectories effectively.

Hyperparameter and Length Impact

Tuning the γ\gamma hyperparameter showed that DMPO's ability to adapt importance weights on state-action pairs significantly enhances its performance, especially in clean settings where longer trajectories are available. Figure 3

Figure 3: The effect of hyperparameter gamma on the relative performance of the model trained with DMPO loss on the WebShop dataset in both noisy and clean settings.

Figure 4

Figure 4: The effect of "loss" trajectories length on the performance of the model trained with DPO and DMPO loss in the noisy setting on ScienceWorld. The base model is Mistral-7B-Instruct-v0.2.

Conclusion

The DMPO loss function effectively addresses compounding errors and partition function dependencies in multi-turn agent tasks, offering a significant improvement over previous single-turn optimization methods. Its application in structured trajectory preference optimization suggests potential for broader AI advancements, particularly in dynamic sequence modeling and interactive agent systems. Future research could explore larger models and datasets to further validate its effectiveness.

This paper highlights the theoretical basis and empirical support for the DMPO's robust multi-turn optimization capability, marking a substantial advancement in preference-based reinforcement learning for language agents.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.