Improving Factuality and Reasoning in Language Models through Multiagent Debate (2305.14325v1)

Published 23 May 2023 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: LLMs have demonstrated remarkable capabilities in language generation, understanding, and few-shot learning in recent years. An extensive body of work has explored how their performance may be further improved through the tools of prompting, ranging from verification, self-consistency, or intermediate scratchpads. In this paper, we present a complementary approach to improve language responses where multiple LLM instances propose and debate their individual responses and reasoning processes over multiple rounds to arrive at a common final answer. Our findings indicate that this approach significantly enhances mathematical and strategic reasoning across a number of tasks. We also demonstrate that our approach improves the factual validity of generated content, reducing fallacious answers and hallucinations that contemporary models are prone to. Our approach may be directly applied to existing black-box models and uses identical procedure and prompts for all tasks we investigate. Overall, our findings suggest that such "society of minds" approach has the potential to significantly advance the capabilities of LLMs and pave the way for further breakthroughs in language generation and understanding.

PDF Abstract

This paper introduces a novel approach called multiagent debate to improve the factuality and reasoning capabilities of LLMs. Instead of relying on a single instance of an LLM to generate an answer, this method involves multiple instances of the same or different LLMs debating with each other over several rounds to arrive at a final consensus answer. The core idea is inspired by the "society of mind" concept, where multiple perspectives and reasoning processes interact and refine each other.

The practical implementation of the multiagent debate process is straightforward and can be applied to existing black-box LLMs through their API interfaces. The procedure works as follows:

Initial Generation: Given a query or problem, multiple independent instances of an LLM (agents) are prompted to generate their initial responses and reasoning processes.
Debate Rounds: In each round of debate, each agent receives the responses and reasoning of all other agents from the previous round. A "consensus prompt" instructs the agent to critique the other responses and update its own answer based on this collective feedback.
Convergence: This process is repeated for several rounds. Empirically, the paper finds that the agents tend to converge on a single, often more accurate, answer.

The prompts used to initiate the debate are crucial. The paper explores different prompt variations to influence the "stubbornness" of agents. A "Short" prompt encourages agents to quickly adapt to others' opinions, leading to faster convergence. A "Long" prompt encourages agents to trust their own initial reasoning more, leading to longer debates which empirically resulted in better final solutions.

For example, given a math problem, initially, different agents might propose different solutions or use different methods (like recognizing a right triangle for side lengths 3, 4, 5 vs. using the Law of Cosines). In the debate rounds, they would examine each other's steps ( $0.5 \times 3 \times 4 = 6$ ) and identify potential errors or validate correct reasoning, iteratively refining their own response.

The paper evaluates this multiagent debate approach on a variety of reasoning and factuality tasks:

Reasoning Tasks: Arithmetic, GSM8K (grade school math), and Chess Move Prediction.
Factuality Tasks: A novel Biographies dataset, MMLU (factual knowledge questions), and Chess Move Validity.

The results consistently show that multiagent debate significantly outperforms single-agent baselines, including a single agent using reflection. On reasoning tasks, the debate process not only amplifies initially correct answers but can also lead to all agents correcting their initial incorrect answers to reach the correct solution. For factuality tasks, debate helps reduce hallucinations and improves accuracy by identifying and resolving inconsistencies between agents' initial responses. Facts that different agents disagree on tend to be debated out or corrected.

Key implementation considerations and findings from the analysis include:

Number of Agents and Rounds: Performance generally improves with an increased number of agents and more debate rounds, although gains may plateau after a few rounds (e.g., 4 rounds) and the performance improvement for more agents becomes less significant beyond 5 agents.
Context Length: As the number of agents and rounds increases, the concatenated responses fed back to each agent can become very long, potentially exceeding the LLM's context window. A practical mitigation is to first summarize the responses from other agents before providing them as context to an individual agent. This summarization step was found to not only manage context but also improve performance.
Mixed Models: The approach works not only with multiple instances of the same model but also when debating between different types of models (e.g., ChatGPT and Bard), showing improvements for both models and their combined performance.
Initialization: While the default uses the same prompt for all agents, using different initialization prompts or personas for different agents can lead to further performance gains.
Cost: The primary limitation is computational expense, as it requires running multiple LLM instances and multiple inference steps per query. However, the improved output quality suggests potential for using this method to generate high-quality training data for self-improvement or for tasks requiring high accuracy.

In summary, multiagent debate provides a practical, prompting-based method to enhance LLM performance by leveraging collective intelligence and iterative refinement, applicable to various tasks without requiring internal model access or finetuning. While more costly than single-agent inference, it offers significant improvements in factual accuracy and reasoning.