SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks (2503.15478v1)

Published 19 Mar 2025 in cs.LG

Abstract: LLM agents need to perform multi-turn interactions in real-world tasks. However, existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs and it remains unclear how to develop such algorithms. To study this, we first introduce a new benchmark, ColBench, where an LLM agent interacts with a human collaborator over multiple turns to solve realistic tasks in backend programming and frontend design. Building on this benchmark, we propose a novel RL algorithm, SWEET-RL (RL with Step-WisE Evaluation from Training-time information), that uses a carefully designed optimization objective to train a critic model with access to additional training-time information. The critic provides step-level rewards for improving the policy model. Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms, enabling Llama-3.1-8B to match or exceed the performance of GPT4-o in realistic collaborative content creation.

PDF Abstract

SWEET-RL: Enhancing Multi-Turn Cooperative Reasoning in LLM Agents

The paper "SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks" presents a novel approach to refining the capabilities of LLMs when functioning as agents in multi-turn collaborative reasoning tasks. This paper specifically addresses challenges inherent in credit assignment over extended interactions, leveraging the reasoning and generalization strengths of LLMs. Traditionally, reinforcement learning (RL) algorithms for multi-turn interactions face difficulties in effectively evaluating and optimizing LLMs across multiple turns, often resulting in suboptimal decision-making due to poor credit assignment and high variance in long-horizon tasks.

The Proposed Solution: SWEET-RL

The authors introduce a reinforcement learning algorithm named SWEET-RL (RL with Step-WisE Evaluation from Training-time information) that innovatively incorporates training-time information unavailable to the policy model during execution. This approach contrasts existing methods by training a critic model to assess step-level actions more accurately and provide granular rewards to the policy model. SWEET-RL's design is meticulously calibrated to optimize the actor model's decision-making processes, leading to a substantial improvement in collaboration success rates and win outcomes in the proposed ColBench benchmark, which simulates realistic collaborative tasks in backend programming and frontend design.

Benchmark and Experimental Results

ColBench serves as a unique evaluation tool, introducing diverse and complex tasks yet minimizing engineering overhead, crucial for fast research prototyping. The benchmark highlights real-world scenarios where an LLM agent collaborates with a simulated human partner to produce complex outputs like code and design artifacts. With over 10,000 tasks available for training, ColBench ensures diverse exposure and challenges LLM agents in realistic contexts without overfitting. SWEET-RL demonstrated a compelling absolute improvement of 6% in success and win rates in ColBench compared to other state-of-the-art algorithms, enabling the Llama-3.1-8B agent to rival or surpass the proprietary model GPT4-o.

Mechanisms and Theoretical Insight

SWEET-RL's success stems from several key innovations:

Critic Model Design: The critic has access to the final outcome and reference solution during training, while using the Bradley-Terry objective to perform robust credit assignment. This asymmetric information access enables precise evaluation of each action's relative utility at any given stage in the interaction sequence.
Advantage Function Learning: SWEET-RL directly trains the advantage function, avoiding the pitfalls of training a full value function that might not generalize well in unseen tasks with limited fine-tuning samples.
Parameterization: The advantage function exploits the mean log probability of action responses, aligning well with LLM architectures pre-trained on next-token prediction, thus leveraging their inherent generalization strengths.

Implications and Future Directions

The implications of SWEET-RL extend both practically and theoretically. On a practical level, this methodology significantly enhances the ability of LLMs to perform complex multi-turn interactions, potentially improving human productivity in collaborative tasks. From a theoretical perspective, SWEET-RL opens pathways to optimizing credit assignment in partially observable environments, providing a framework for integrating supervised learning with training-time information in reinforcement learning tasks.

Future research may explore refining credit assignment mechanisms and leveraging training-time data across different domains. Further exploration might include expanding SWEET-RL's framework for use in other complex collaborative scenarios, thereby enhancing its versatility and robustness. Additionally, understanding its alignment with various model architectures and the impact of different scales of training data could broaden its applicability and efficiency.

In conclusion, the paper offers a significant advancement in multi-turn reinforcement learning for LLM agents, showcasing a strategic blend of innovative algorithmic configurations and robust benchmarking to amplify the reasoning abilities of collaborative LLM agents.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Yifei Zhou (24 papers)
Song Jiang (66 papers)
Yuandong Tian (128 papers)
Jason Weston (130 papers)
Sergey Levine (531 papers)
Sainbayar Sukhbaatar (53 papers)
Xian Li (115 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/AIatMeta/status/1903146902419542337

https://twitter.com/iScienceLuvr/status/1902594286312419576

https://twitter.com/YifeiZhou02/status/1904206352299462883

https://twitter.com/MervinPraison/status/1903479498017763419

https://twitter.com/anmolagarwal999/status/1930488055246295230