Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Reasoning in Large Language Models via Multi-Agent Peer Review Collaboration (2311.08152v2)

Published 14 Nov 2023 in cs.CL

Abstract: LLMs have shown remarkable capabilities in general natural language processing tasks but often fall short in complex reasoning tasks. Recent studies have explored human-like problem-solving strategies, such as self-correct, to push further the boundary of single-model reasoning ability. In this work, we let a single model "step outside the box" by engaging multiple models to correct each other. We introduce a multi-agent collaboration strategy that emulates the academic peer review process. Each agent independently constructs its own solution, provides reviews on the solutions of others, and assigns confidence levels to its reviews. Upon receiving peer reviews, agents revise their initial solutions. Extensive experiments on three different types of reasoning tasks show that our collaboration approach delivers superior accuracy across all ten datasets compared to existing methods. Further study underscores the effectiveness of integrating confidence in reviews, demonstrates the superiority of feedback exchange over mere solution sharing, and highlights the role of capability and diversity in fostering successful collaboration.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Palm 2 technical report.
  2. Sparks of artificial general intelligence: Early experiments with gpt-4.
  3. Chateval: Towards better llm-based evaluators through multi-agent debate.
  4. Think you have solved question answering? try arc, the ai2 reasoning challenge.
  5. Training verifiers to solve math word problems.
  6. Improving factuality and reasoning in language models through multiagent debate.
  7. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance.
  8. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
  9. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523–533, Doha, Qatar. Association for Computational Linguistics.
  10. Large language models cannot self-correct reasoning yet.
  11. Language models can solve computer tasks.
  12. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
  13. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597.
  14. Camel: Communicative agents for "mind" exploration of large language model society.
  15. Prd: Peer rank and discussion improve large language model based evaluations.
  16. Encouraging divergent thinking in large language models through multi-agent debate.
  17. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver, Canada. Association for Computational Linguistics.
  18. Self-refine: Iterative refinement with self-feedback.
  19. M. Minsky. 1988. Society Of Mind. Touchstone book. Simon & Schuster.
  20. OpenAI. 2023. Gpt-4 technical report.
  21. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies.
  22. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
  23. Communicative agents for software development.
  24. Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752, Lisbon, Portugal. Association for Computational Linguistics.
  25. Reflexion: Language agents with verbal reinforcement learning.
  26. Corex: Pushing the boundaries of complex reasoning through multi-model collaboration.
  27. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, Toronto, Canada. Association for Computational Linguistics.
  28. Llama 2: Open foundation and fine-tuned chat models.
  29. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2609–2634, Toronto, Canada. Association for Computational Linguistics.
  30. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  31. Unleashing cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration.
  32. Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
  33. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  34. The rise and potential of large language model based agents: A survey.
  35. Examining inter-consistency of large language models collaboration: An in-depth analysis via debate. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7572–7590, Singapore. Association for Computational Linguistics.
  36. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.
  37. Exploring collaboration mechanisms for llm agents: A social psychology view.
  38. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhenran Xu (12 papers)
  2. Senbao Shi (3 papers)
  3. Baotian Hu (67 papers)
  4. Jindi Yu (3 papers)
  5. Dongfang Li (46 papers)
  6. Min Zhang (630 papers)
  7. Yuxiang Wu (27 papers)
Citations (9)