Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate (2305.11595v3)

Published 19 May 2023 in cs.CL and cs.AI

Abstract: LLMs have shown impressive capabilities in various applications, but they still face various inconsistency issues. Existing works primarily focus on the inconsistency issues within a single LLM, while we complementarily explore the inter-consistency among multiple LLMs for collaboration. To examine whether LLMs can collaborate effectively to achieve a consensus for a shared goal, we focus on commonsense reasoning, and introduce a formal debate framework (FORD) to conduct a three-stage debate among LLMs with real-world scenarios alignment: fair debate, mismatched debate, and roundtable debate. Through extensive experiments on various datasets, LLMs can effectively collaborate to reach a consensus despite noticeable inter-inconsistencies, but imbalances in their abilities can lead to domination by superior LLMs. Leveraging a more advanced LLM like GPT-4 as an authoritative judge can boost collaboration performance. Our work contributes to understanding the inter-consistency among LLMs and lays the foundation for developing future collaboration methods. Codes and data are available at https://github.com/Waste-Wood/FORD

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Philip LaVerne Bell. 1998. Designing for students’ science learning using argumentation and classroom debate. University of California, Berkeley.
  2. Abductive commonsense reasoning. In International Conference on Learning Representations.
  3. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  5. Self-supervised logic induction for explainable fuzzy temporal commonsense reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12580–12588.
  6. The urgent need to improve health care quality: Institute of medicine national roundtable on health care quality. Jama, 280(11):1000–1005.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  8. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  9. Explaining answers with entailment trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7358–7370.
  10. e-care: a new dataset for exploring explainable causal reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 432–446.
  11. Enhancing pretrained language models with structured commonsense knowledge for textual inference. Knowledge-Based Systems, 254:109488.
  12. Complexity-based prompting for multi-step reasoning. International Conference on Learning Representations.
  13. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
  14. SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 394–398, Montréal, Canada. Association for Computational Linguistics.
  15. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  16. Modeling dominance in group conversations using nonverbal activity cues. IEEE Transactions on Audio, Speech, and Language Processing, 17(3):501–513.
  17. Maieutic prompting: Logically consistent reasoning with recursive explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1266–1279, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  18. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
  19. On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336.
  20. Igor Mayer. 1997. Debating technologies. A Methodological Contribution to the Design and Evaluation of Participatory Policy Analysis. Tilburg, The Netherlands.
  21. OpenAI. 2023. Gpt-4 technical report.
  22. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  23. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442.
  24. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  25. Recognizing and enforcing state and tribal judgments: A roundtable discussion of law, policy, and practice. Am. Indian L. Rev., 18:239.
  26. Unpacking large language models with conceptual consistency. arXiv preprint arXiv:2209.15093.
  27. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  28. Social iqa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473.
  29. Peer: A collaborative language model. arXiv preprint arXiv:2208.11663.
  30. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158.
  31. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  32. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  33. Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2714–2730.
  34. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  35. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109.
  36. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  37. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837.
  38. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  39. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671.
  40. Reco: Reliable causal chain reasoning via structural causal recurrent neural networks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6426–6438.
  41. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Kai Xiong (33 papers)
  2. Xiao Ding (38 papers)
  3. Yixin Cao (138 papers)
  4. Ting Liu (329 papers)
  5. Bing Qin (186 papers)
Citations (38)

Summary

We haven't generated a summary for this paper yet.