MALT: Improving Reasoning with Multi-Agent LLM Training (2412.01928v1)

Published 2 Dec 2024 in cs.LG and cs.AI

Abstract: Enabling effective collaboration among LLMs is a crucial step toward developing autonomous systems capable of solving complex problems. While LLMs are typically used as single-model generators, where humans critique and refine their outputs, the potential for jointly-trained collaborative models remains largely unexplored. Despite promising results in multi-agent communication and debate settings, little progress has been made in training models to work together on tasks. In this paper, we present a first step toward "Multi-agent LLM training" (MALT) on reasoning problems. Our approach employs a sequential multi-agent setup with heterogeneous LLMs assigned specialized roles: a generator, verifier, and refinement model iteratively solving problems. We propose a trajectory-expansion-based synthetic data generation process and a credit assignment strategy driven by joint outcome based rewards. This enables our post-training setup to utilize both positive and negative trajectories to autonomously improve each model's specialized capabilities as part of a joint sequential system. We evaluate our approach across MATH, GSM8k, and CQA, where MALT on Llama 3.1 8B models achieves relative improvements of 14.14%, 7.12%, and 9.40% respectively over the same baseline model. This demonstrates an early advance in multi-agent cooperative capabilities for performance on mathematical and common sense reasoning questions. More generally, our work provides a concrete direction for research around multi-agent LLM training approaches.

PDF HTML Abstract

Improving Reasoning with Multi-Agent LLM Training: A Comprehensive Overview

The paper presents Multi-Agent LLM Training (MALT), an innovative approach to enhance reasoning capabilities within collaborative systems of LLMs. This research is situated within the broader context of machine learning and artificial intelligence, where autonomous systems increasingly rely on inter-model cooperation to solve complex tasks effectively. Historically, single-model techniques have dominated the application of LLMs, where outputs are primarily scrutinized and refined by humans. However, the paper delineates a distinct pathway, emphasizing the collaborative training of heterogeneous LLMs in a multi-agent setup to achieve superior reasoning performance.

Methodology and Approach

MALT introduces a multi-agent framework comprising three LLMs, each with specialized roles: a generator, a verifier, and a refinement model. The paper adopts a novel trajectory-expansion methodology to generate synthetic training data through sequential multi-agent interactions. These interactions expand reasoning trajectories, creating large quantities of labeled data. The data generated is crucial for the refined training process, which involves Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). These fine-tuning techniques are applied not only to increase the generator's capabilities but also to enhance the verifier and refinement models, allowing them to address errors and adaptively produce more accurate outputs.

A critical component of MALT is the credit assignment strategy, which is anchored in a joint outcome-based reward system. This technique allows for the backward propagation of success and failure signals through the sequential reasoning chain, thereby enhancing the decision-making process of each agent in their respective roles. By using both positive and negative trajectories, MALT facilitates an automatic improvement mechanism that is independent of human intervention, particularly in terms of trajectory selection or value function design.

Experimental Validation

Extensive experimental validation is provided using three well-regarded benchmarks: MATH, GSM8K, and CSQA. On these datasets, the MALT approach, utilizing Llama 3.1 8B models, demonstrates significant improvements of 14.14%, 7.12%, and 9.40%, respectively, over baseline models. These results are compelling, especially in context to the incremental improvements traditionally seen in LLM advancements via single-agent systems. The enhancements to both mathematical and commonsense reasoning tasks reflect MALT’s ability to leverage multi-agent cooperation efficiently within complex reasoning scenarios.

Implications and Future Directions

The implications of this paper are multifaceted. Practically, the integration of MALT offers a pathway to develop more nuanced and cooperative AI systems capable of autonomous problem-solving with minimal external supervision. Theoretically, it opens avenues for further exploration of collaborative multi-agent systems, particularly in refining credit assignment strategies and trajectory management. Moreover, the emphasis on trajectory-expansion points towards scalable methods of data generation in AI training workflows.

Potential future developments could include adaptive thresholding mechanisms in credit assignment to optimize sample efficiency further. Moreover, iterating these techniques within dynamic environments, where distributed agents interact in real-time, could advance the robustness and adaptability of AI systems in real-world applications. Expanding MALT's applicability to larger distributed systems and exploring its integration with other RL paradigms also presents an intriguing research frontier.

In summary, MALT presents a structured, methodologically sound addition to the toolkit for LLM performance enhancement, offering robust empirical results supporting the efficacy of multi-agent training. As AI continues to evolve, frameworks like MALT will likely become foundational in developing systems that are more capable of mimicking sophisticated human-like collaborative problem-solving.