Over-Reasoning and Redundant Calculation of Large Language Models (2401.11467v2)
Abstract: LLMs can solve problems step-by-step. While this chain-of-thought (CoT) reasoning boosts LLMs' performance, it is unclear if LLMs \textit{know} when to use CoT and whether those CoT are always necessary to answer the question. This paper shows that LLMs tend to generate redundant calculations and reasoning on a manually constructed math QA dataset, GSM8K-Zero. GSM8K-Zero is constructed such that the questions can be answered without any calculations, but LLMs, including Llama-2 models and Claude-2, tend to generate lengthy and unnecessary calculations to answer the questions. We also conduct experiments to explain why LLMs generate redundant calculations and reasonings. GSM8K-Zero is publicly available at https://github.com/d223302/Over-Reasoning-of-LLMs and https://huggingface.co/datasets/dcml0714/GSM8K-Zero.
- Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all.
- Palm 2 technical report.
- Anthropic. 2023. Model card and evaluations for claude models. Accessed on October 1, 2023.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- ROSCOE: A suite of metrics for scoring step-by-step reasoning. In The Eleventh International Conference on Learning Representations.
- How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arxiv:2301.07597.
- Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
- OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. Accessed on October 10, 2023.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama 2: Open foundation and fine-tuned chat models.
- Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2609–2634, Toronto, Canada. Association for Computational Linguistics.
- Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
- The unreliability of explanations in few-shot prompting for textual reasoning. In Advances in Neural Information Processing Systems.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Cheng-Han Chiang (18 papers)
- Hung-yi Lee (325 papers)