Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Over-Reasoning and Redundant Calculation of Large Language Models (2401.11467v2)

Published 21 Jan 2024 in cs.CL

Abstract: LLMs can solve problems step-by-step. While this chain-of-thought (CoT) reasoning boosts LLMs' performance, it is unclear if LLMs \textit{know} when to use CoT and whether those CoT are always necessary to answer the question. This paper shows that LLMs tend to generate redundant calculations and reasoning on a manually constructed math QA dataset, GSM8K-Zero. GSM8K-Zero is constructed such that the questions can be answered without any calculations, but LLMs, including Llama-2 models and Claude-2, tend to generate lengthy and unnecessary calculations to answer the questions. We also conduct experiments to explain why LLMs generate redundant calculations and reasonings. GSM8K-Zero is publicly available at https://github.com/d223302/Over-Reasoning-of-LLMs and https://huggingface.co/datasets/dcml0714/GSM8K-Zero.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all.
  2. Palm 2 technical report.
  3. Anthropic. 2023. Model card and evaluations for claude models. Accessed on October 1, 2023.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  6. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  7. ROSCOE: A suite of metrics for scoring step-by-step reasoning. In The Eleventh International Conference on Learning Representations.
  8. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arxiv:2301.07597.
  9. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
  10. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
  11. OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. Accessed on October 10, 2023.
  12. OpenAI. 2023. Gpt-4 technical report.
  13. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems.
  14. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  15. Llama 2: Open foundation and fine-tuned chat models.
  16. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2609–2634, Toronto, Canada. Association for Computational Linguistics.
  17. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  18. The unreliability of explanations in few-shot prompting for textual reasoning. In Advances in Neural Information Processing Systems.
  19. Judging llm-as-a-judge with mt-bench and chatbot arena.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Cheng-Han Chiang (18 papers)
  2. Hung-yi Lee (325 papers)
Citations (3)