Training Language Models to Win Debates with Self-Play Improves Judge Accuracy (2409.16636v1)

Published 25 Sep 2024 in cs.CL and cs.AI

Abstract: We test the robustness of debate as a method of scalable oversight by training models to debate with data generated via self-play. In a long-context reading comprehension task, we find that LLM based evaluators answer questions more accurately when judging models optimized to win debates. By contrast, we find no such relationship for consultancy models trained to persuade a judge without an opposing debater present. In quantitative and qualitative comparisons between our debate models and novel consultancy baselines, we find evidence that debate training encourages stronger and more informative arguments, showing promise that it can help provide high-quality supervision for tasks that are difficult to directly evaluate.

PDF HTML Abstract

Training LLMs to Win Debates With Self-Play Improves Judge Accuracy

The paper presents a detailed investigation into the application of debate as a scalable oversight mechanism for LLMs. The authors, Samuel Arnesen et al., offer empirical evidence supporting that training LLMs (LMs) to debate can lead to more accurate judgments by evaluators, both human and model-based. This paper explores whether training models to win debates can help make evaluators better at discerning correct responses in a complex task: long-context reading comprehension.

Key Contributions

Debate vs. Consultancy Models:
- The paper compares the effectiveness of models trained to debate against those trained to act as consultants. Debaters are trained to argue against one another, presenting opposing sides of an argument, thus helping the judge to identify the correct answer through adversarial interaction.
- Consultants, in contrast, are trained to persuade the judge without an opposing debater, emulating a traditional reinforcement learning from human feedback (RLHF) approach.
Training and Evaluation Setup:
- The models were tested on the QuALITY dataset, involving multiple-choice reading comprehension questions set against short stories. This dataset necessitates understanding and extracting relevant information from long texts.
- The arguments were developed in a structured debate protocol, with debaters engaging in two-turn, simultaneous debates. The juxtaposition here is the consultancy models were evaluated across three setups: single, ensembled, and double consultancy.
Training Techniques:
- The paper employs an innovative variant of Direct Preference Optimization (DPO), leveraging continuous preference scores derived from a calibrated GPT-4-Turbo judge, thus ensuring a robust training mechanism focused on the model's ability to win debates.

Results and Findings

Improvement in Judge Accuracy:
- Training models to debate resulted in a 4% absolute increase in judge accuracy after training. This outcome is statistically significant, highlighting that the adversarial nature of debating does indeed enhance the model's quality of supervision.
- Consultancy models did not show a comparable improvement, indicating the inherent advantage of debate training in this context.
Analysis of Model Behavior:
- The debate models developed more effective argumentative strategies, which were not observed in consultancy models. Debate-trained models used more textual evidence and delivered less repetitive and more persuasive arguments.
- The effectiveness of the debate-trained models was further validated by their performance when evaluated by an untrained GPT-4 model, indicating that these models learned more universally applicable argumentation skills.

Implications and Future Work

Practical Implications:
- These findings suggest that as LLMs tackle increasingly complex tasks, utilizing debate-based training strategies could be essential for scalable oversight. This method helps in better supervision by surfacing subtle flaws through adversarial interactions.
- For AI systems involved in high-stakes decision-making or requiring high reliability under domain-specific expertise, debate-trained evaluators may offer a more robust mechanism compared to traditional RLHF-trained models.
Theoretical Implications:
- This work reinforces the theoretical argument that debate can simplify evaluation tasks for a judge by having the debaters discover and explain flaws in each other’s arguments. It also implies that debate might scale better with the sophistication of the models than consultancy-based approaches.
- The lack of significant findings in consultancy models highlights potential limitations in current RLHF methods, pointing towards the need for innovative approaches like debate training to achieve robust AI alignment and oversight.

Future Directions

Extended Domains and Judge Models:
- Future research should explore debate training across various domains beyond reading comprehension, such as reasoning tasks or creative problem-solving.
- Enhancing judge models to better interact with debate structures or incorporating models with different perspectives could provide richer insights and better oversight capabilities.
Advanced Training Techniques:
- Investigations into more sophisticated training techniques or hybrid models that combine debate dynamics with other methods could yield further improvements in both robustness and interpretability.

Conclusion

The paper by Arnesen et al. advances the field of AI alignment and oversight by demonstrating that debate-trained models outperform consultancy models in producing accurate evaluations. This work underlines the potential of adversarial training techniques as a powerful tool for developing highly reliable AI systems. As the complexity of tasks tackled by AI continues to grow, methodologies like debate training will be crucial for scalable and dependable oversight mechanisms.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Samuel Arnesen (1 paper)
David Rein (6 papers)
Julian Michael (28 papers)

Related Papers

AI safety via debate (2018)
Combating Adversarial Attacks with Multi-Agent Debate (2024)
Debate Helps Supervise Unreliable Experts (2023)
Debating with More Persuasive LLMs Leads to More Truthful Answers (2024)
On scalable oversight with weak LLMs judging strong LLMs (2024)

Find Related Papers

Tweets

https://twitter.com/SamuelAlbanie/status/1840823406662459496