Getting MoRE out of Mixture of Language Model Reasoning Experts (2305.14628v2)
Abstract: While recent LLMs improve on various question answering (QA) datasets, it remains difficult for a single model to generalize across question types that require distinct reasoning abilities. We provide empirical evidence that state-of-the-art LLMs suffer from poor generalizability on reasoning types beyond those seen in the prompt. To remedy this, we propose a Mixture-of-Reasoning-Experts (MoRE) framework that ensembles diverse specialized LLMs. We specialize the backbone LLM with prompts optimized for different reasoning categories, including factual, multihop, mathematical, and commonsense reasoning. Our key insight is to leverage agreement among the specialized experts to select the best answer for each question, or to abstain from answering. This gives MoRE higher accuracy than any single specialized model on a collection of 12 QA datasets from four reasoning types. Beyond generalizability, the interpretable design of MoRE improves selective question answering results compared to baselines without incorporating inter-expert agreement. This framework is also more interpretable and useful to human consumers of QA outputs. Our human study confirms that presenting expert predictions and the answer selection process helps annotators more accurately calibrate when to trust the system's output. We release all code and data to facilitate future work.
- Evaluating Large Language Models Trained on Code. arXiv, abs/2107.03374.
- Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. arXiv, abs/2211.12588.
- Training Verifiers to Solve Math Word Problems. arXiv, abs/2110.14168.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL.
- Ran El-Yaniv and Yair Wiener. 2010. On the Foundations of Noise-free Selective Classification. Journal of Machine Learning Research, 11:1605–1641.
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research.
- Shi Feng and Jordan Boyd-Graber. 2022. Learning to Explain Selectively: A Case Study on Question Answering. In Conference on Empirical Methods in Natural Language Processing.
- MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension. In Proceedings of EMNLP.
- Single-dataset Experts for Multi-dataset Question Answering. In Proceedings of EMNLP.
- Question Answering is a Format; When is it Useful? arXiv, abs/1909.11291.
- Siddhant Garg and Alessandro Moschitti. 2021. Will this Question be Answered? Question Filtering via Answer Model Distillation for Efficient Question Answering. In Conference on Empirical Methods in Natural Language Processing.
- DEMix Layers: Disentangling Domains for Modular Language Modeling. In Proceedings of NAACL.
- Scaling Expert Language Models with Unsupervised Domain Discovery. arXiv, abs/2303.14177.
- Unsupervised Dense Information Retrieval with Contrastive Learning. Transactions on Machine Learning Research.
- Adaptive Mixtures of Local Experts. Neural Computation, 3:79–87.
- LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion. In Proceedings of ACL.
- How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering. Transactions of the Association for Computational Linguistics, 9:962–977.
- TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of ACL.
- Selective Question Answering under Domain Shift. In Proceedings of ACL.
- UnifiedQA: Crossing Format Boundaries With a Single QA System. In Findings of EMNLP.
- QASC: A Dataset for Question Answering via Sentence Composition. In Proceedings of AAAI.
- ChatGPT: Jack of all trades, master of none. arXiv, abs/2302.10724.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In Proceedings of ICLR.
- BASE Layers: Simplifying Training of Large, Sparse Models. In ICML.
- Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models. ArXiv, abs/2208.03306.
- Making Large Language Models Better Reasoners with Step-Aware Verifier. arXiv, abs/2206.02336.
- Generated Knowledge Prompting for Commonsense Reasoning. In Proceedings of ACL.
- MetaICL: Learning to Learn In Context. In Proceedings of NAACL.
- Cross-Task Generalization via Natural Language Crowdsourcing Instructions. In Proceedings of ACL.
- OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt.
- Are NLP Models really able to Solve Simple Math Word Problems? In Proceedings of NAACL.
- Measuring and narrowing the compositionality gap in language models. arXiv, abs/2210.03350.
- MetaQA: Combining Expert Agents for Multi-Skill Question Answering. In Proceedings of EACL.
- Answering Open-Domain Questions of Varying Reasoning Steps from Text. In Proceedings of EMNLP.
- Is ChatGPT a General-Purpose Natural Language Processing Task Solver? In Proceedings of EMNLP.
- SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of EMNLP.
- Quizbowl: The case for incremental question answering. Journal of Machine Learning Research.
- Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Computing Surveys, 55:1 – 45.
- Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of EMNLP.
- Prompting GPT-3 To Be Reliable. In Proceedings of ICLR.
- Large Language Models Help Humans Verify Truthfulness – Except When They Are Convincingly Wrong. ArXiv, abs/2310.12558.
- Revisiting Calibration for Question Answering. Findings of EMNLP.
- Alon Talmor and Jonathan Berant. 2019. MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension. In Proceedings of ACL.
- CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In Proceedings of NAACL.
- CommonsenseQA 2.0: Exposing the Limits of AI through Gamification. In Proceedings of NeurIPS.
- MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics, 10:539–554.
- Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of ICLR.
- Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. In Conference on Empirical Methods in Natural Language Processing.
- Finetuned Language Models Are Zero-Shot Learners. In Proceedings of ICLR.
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of NeurIPS.
- Reliable visual question answering: Abstain rather than answer incorrectly. In Proceedings of ECCV.
- Calibrating Trust of Multi-Hop Question Answering Systems with Decompositional Probes. In Proceedings of EMNLP.
- HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of EMNLP.
- Satisfiability-Aided Language Models Using Declarative Prompting. In Proceedings of NeurIPS.
- Can Explanations Be Useful for Calibrating Black Box Models? In Proceedings of ACL.
- Knowing More About Questions Can Help: Improving Calibration in Question Answering. In Findings of ACL.
- Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections. In Conference on Empirical Methods in Natural Language Processing.
- Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. In Proceedings of ICLR.
- Chenglei Si (26 papers)
- Weijia Shi (55 papers)
- Chen Zhao (249 papers)
- Luke Zettlemoyer (225 papers)
- Jordan Boyd-Graber (68 papers)