Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate (2305.11595v3)
Abstract: LLMs have shown impressive capabilities in various applications, but they still face various inconsistency issues. Existing works primarily focus on the inconsistency issues within a single LLM, while we complementarily explore the inter-consistency among multiple LLMs for collaboration. To examine whether LLMs can collaborate effectively to achieve a consensus for a shared goal, we focus on commonsense reasoning, and introduce a formal debate framework (FORD) to conduct a three-stage debate among LLMs with real-world scenarios alignment: fair debate, mismatched debate, and roundtable debate. Through extensive experiments on various datasets, LLMs can effectively collaborate to reach a consensus despite noticeable inter-inconsistencies, but imbalances in their abilities can lead to domination by superior LLMs. Leveraging a more advanced LLM like GPT-4 as an authoritative judge can boost collaboration performance. Our work contributes to understanding the inter-consistency among LLMs and lays the foundation for developing future collaboration methods. Codes and data are available at https://github.com/Waste-Wood/FORD
- Philip LaVerne Bell. 1998. Designing for students’ science learning using argumentation and classroom debate. University of California, Berkeley.
- Abductive commonsense reasoning. In International Conference on Learning Representations.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Self-supervised logic induction for explainable fuzzy temporal commonsense reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12580–12588.
- The urgent need to improve health care quality: Institute of medicine national roundtable on health care quality. Jama, 280(11):1000–1005.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Explaining answers with entailment trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7358–7370.
- e-care: a new dataset for exploring explainable causal reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 432–446.
- Enhancing pretrained language models with structured commonsense knowledge for textual inference. Knowledge-Based Systems, 254:109488.
- Complexity-based prompting for multi-step reasoning. International Conference on Learning Representations.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
- SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 394–398, Montréal, Canada. Association for Computational Linguistics.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Modeling dominance in group conversations using nonverbal activity cues. IEEE Transactions on Audio, Speech, and Language Processing, 17(3):501–513.
- Maieutic prompting: Logically consistent reasoning with recursive explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1266–1279, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
- On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336.
- Igor Mayer. 1997. Debating technologies. A Methodological Contribution to the Design and Evaluation of Participatory Policy Analysis. Tilburg, The Netherlands.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Recognizing and enforcing state and tribal judgments: A roundtable discussion of law, policy, and practice. Am. Indian L. Rev., 18:239.
- Unpacking large language models with conceptual consistency. arXiv preprint arXiv:2209.15093.
- Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
- Social iqa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473.
- Peer: A collaborative language model. arXiv preprint arXiv:2208.11663.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2714–2730.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671.
- Reco: Reliable causal chain reasoning via structural causal recurrent neural networks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6426–6438.
- Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.