ALERT: Adapting Language Models to Reasoning Tasks (2212.08286v2)
Abstract: Current LLMs can perform reasonably well on complex tasks that require step-by-step reasoning with few-shot learning. Are these models applying reasoning skills they have learnt during pre-training and reason outside of their training context, or are they simply memorizing their training corpus at finer granularity and have learnt to better understand their context? To tease apart these possibilities, we introduce ALERT, a benchmark and suite of analyses for assessing LLMs' reasoning ability comparing pre-trained and finetuned models on complex tasks that require reasoning skills to solve. ALERT provides a test bed to asses any LLM on fine-grained reasoning skills, which spans over 20 datasets and covers 10 different reasoning skills. We leverage ALERT to further investigate the role of finetuning. With extensive empirical analysis we find that LLMs learn more reasoning skills such as textual entailment, abductive reasoning, and analogical reasoning during finetuning stage compared to pretraining state. We also find that when LLMs are finetuned they tend to overfit to the prompt template, which hurts the robustness of models causing generalization problems.
- Explanations for CommonsenseQA: New Dataset and Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3050–3065, Online. Association for Computational Linguistics.
- Opt-r: Exploring the role of explanations in finetuning and prompting for reasoning skills of large language models. arXiv preprint arXiv:2305.12001.
- Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684.
- The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media, volume 14, pages 830–839.
- Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31.
- Zeming Chen and Qiyue Gao. 2022. Curriculum: A broad-coverage benchmark for linguistic phenomena in natural language understanding. arXiv preprint arXiv:2204.06283.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720.
- Harvesting common-sense navigational knowledge for robotics from uncurated text corpora. In Conference on Robot Learning, pages 525–534. PMLR.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
- Roscoe: A suite of metrics for scoring step-by-step reasoning.
- Don’t stop pretraining: adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
- Measuring mathematical problem solving with the math dataset. NeurIPS.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
- Semeval-2019 task 10: math question answering. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 893–899.
- Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017.
- Measuring compositional generalization: A comprehensive method on realistic data. arXiv preprint arXiv:1912.09713.
- Najoung Kim and Tal Linzen. 2020. Cogs: A compositional generalization challenge based on semantic interpretation. arXiv preprint arXiv:2010.05465.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Unsupervised stance detection for arguments from consequences. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 50–60.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
- Brenden Lake and Marco Baroni. 2018. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pages 2873–2882. PMLR.
- TellMeWhy: A dataset for answering why-questions in narratives. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 596–610, Online. Association for Computational Linguistics.
- CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840, Online. Association for Computational Linguistics.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver, Canada. Association for Computational Linguistics.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam. CoRR, abs/1711.05101.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789.
- Cross-task generalization via natural language crowdsourcing instructions. In ACL.
- A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849.
- Evaluating theory of mind in question answering. arXiv preprint arXiv:1808.09352.
- Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
- Automatic differentiation in pytorch. In NIPS 2017 Workshop on Autodiff.
- Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361.
- CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
- Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637.
- Christopher Rytting and David Wingate. 2021. Leveraging the inductive bias of large language models for abstract textual reasoning. Advances in Neural Information Processing Systems, 34:17111–17122.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- DREAM: A challenge dataset and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics.
- Proofwriter: Generating implications, proofs, and abductive statements over natural language. arXiv preprint arXiv:2012.13048.
- Trieu H Trinh and Quoc V Le. 2018. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847.
- Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
- Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks. In EMNLP.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.
- Cod3s: Diverse generation with discrete semantic signatures. arXiv preprint arXiv:2010.02882.
- Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209.
- Symbolic knowledge distillation: from general language models to commonsense models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4602–4625, Seattle, United States. Association for Computational Linguistics.
- Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698.
- Anlizing the adversarial natural language inference dataset. In Proceedings of the 5th Annual Meeting of the Society for Computation in Linguistics, pages 23–54. Association for Computational Linguistics.
- WinoWhy: A deep diagnosis of essential commonsense knowledge for answering Winograd schema challenge. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5736–5745, Online. Association for Computational Linguistics.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.
- Ping Yu (42 papers)
- Tianlu Wang (33 papers)
- Olga Golovneva (17 papers)
- Badr AlKhamissi (24 papers)
- Siddharth Verma (7 papers)
- Zhijing Jin (68 papers)
- Gargi Ghosh (30 papers)
- Mona Diab (71 papers)
- Asli Celikyilmaz (80 papers)