Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models (2401.05618v3)

Published 11 Jan 2024 in cs.CL and cs.AI

Abstract: In this paper, we introduce Concise Chain-of-Thought (CCoT) prompting. We compared standard CoT and CCoT prompts to see how conciseness impacts response length and correct-answer accuracy. We evaluated this using GPT-3.5 and GPT-4 with a multiple-choice question-and-answer (MCQA) benchmark. CCoT reduced average response length by 48.70% for both GPT-3.5 and GPT-4 while having a negligible impact on problem-solving performance. However, on math problems, GPT-3.5 with CCoT incurs a performance penalty of 27.69%. Overall, CCoT leads to an average per-token cost reduction of 22.67%. All code, data, and supplemental materials are available on GitHub at https://github.com/matthewrenze/jhu-concise-cot

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Anyscale. Pricing, 2023. Available at: https://docs.endpoints.anyscale.com/pricing/, Accessed: 2023-12-07.
  2. Language models are few-shot learners. In H Larochelle, M Ranzato, R Hadsell, M F Balcan, and H Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  3. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, 3 2018.
  4. Agent instructs large language models to be general zero-shot reasoners. arXiv, 10 2023.
  5. Waleed Kadous. Numbers every llm developer should know, 5 2023. Available at: https://www.anyscale.com/blog/num-every-llm-developer-should-know, Accessed: 2023-12-07.
  6. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213, 5 2022.
  7. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. In International Joint Conference on Artificial Intelligence, 2020.
  8. Text and patterns: For effective chain of thought, it takes two to tango. arXiv, 9 2022.
  9. Text and patterns: For effective chain of thought it takes two to tango, 2022. Available at: https://openreview.net/forum?id=z9fXRC5XdT, Accessed: 2023-12-07.
  10. On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 18:50–60, 3 1947.
  11. Augmented language models: a survey. arXiv, 2 2023.
  12. Microsoft. Azure openai service, 2023. Available at: https://azure.microsoft.com/en-us/products/ai-services/openai-service/, Accessed: 2023-12-07.
  13. OpenAI. Introducing chatgpt, 11 2022. Available at: https://openai.com/blog/chatgpt, Accessed: 2023-04-29.
  14. OpenAI. Gpt-4, 3 2023. Available at: https://openai.com/research/gpt-4, Accessed: 2023-04-29.
  15. OpenAI. Gpt-4 technical report. arXiv, 3 2023. Accessed: 2023-04-29.
  16. OpenAI. Models, 2023. Available at: https://openai.com/product#models, Accessed: 2023-12-07.
  17. OpenAI. Openai - api reference, 2023. Available at: https://platform.openai.com/docs/api-reference/chat/create, Accessed: 2023-11-26.
  18. OpenAI. Pricing, 2023. Available at: https://openai.com/pricing, Accessed: 2023-12-07.
  19. OpenAI. Tokens, 2023. Available at: https://platform.openai.com/docs/introduction/tokens, Accessed: 2023-12-07.
  20. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning, pages 248–260. PMLR, 2022.
  21. The SciPy Community. Scipy v1.11.4 manual - scipy.stats.mannwhitneyu, 2023. Available at: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html, Accessed: 2023-12-07.
  22. From lsat: The progress and challenges of complex reasoning. IEEE/ACM Transactions on Audio, Speech and Language Processing, 30:2201–2216, 8 2021.
  23. Chain-of-thought prompting elicits reasoning in large language models. arXiv, 1 2022.
  24. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv, 2 2023.
  25. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  26. Agieval: A human-centric benchmark for evaluating foundation models. ArXiv, 4 2023.
  27. Large language models are human-level prompt engineers. The Eleventh International Conference on Learning Representations, 11 2023.
Citations (4)

Summary

  • The paper demonstrates that CCoT reduces response tokens by up to 48.70% without a significant loss in overall accuracy, enhancing efficiency.
  • It employs benchmark MCQA datasets with GPT-3.5 and GPT-4, using the Mann-Whitney U test to validate statistical significance of the findings.
  • The study reveals practical benefits such as 21-23% cost savings and reduced energy consumption, while noting a performance drop in math tasks for GPT-3.5.

The Benefits of a Concise Chain of Thought on Problem-Solving in LLMs

The paper "The Benefits of a Concise Chain of Thought on Problem-Solving in LLMs" by Renze and Guven examines the efficiency and effectiveness of Concise Chain-of-Thought (CCoT) prompting compared to standard Chain-of-Thought (CoT) prompting in LLMs. This paper investigates the trade-off between response length and problem-solving performance, utilizing two prominent LLMs: GPT-3.5 and GPT-4.

Background and Motivation

Recent advancements in LLMs have elevated their utility for a wide array of AI applications. However, realizing the full potential of these models often necessitates sophisticated prompt-engineering techniques. CoT prompting, which instructs LLMs to reason through problems step-by-step, has shown effectiveness in enhancing the problem-solving capabilities of LLMs. Despite its benefits, CoT tends to increase response verbosity, thereby inflating operational costs due to the per-token pricing model employed by most LLM APIs.

The authors introduce CCoT, which aims to balance the verbosity of responses with the need for comprehensive problem-solving reasoning. This balance is achieved by instructing the LLM to be concise while reasoning step-by-step, potentially reducing token usage without compromising the quality of the output.

Methodology

To evaluate CCoT, the paper uses a benchmark dataset of Multiple-Choice Question-and-Answer (MCQA) problems drawn from a variety of problem domains, including ARC, AGIEval, HellaSwag, and MedMCQA. Experimental comparisons were conducted using three prompt types: answer-only, standard CoT, and CCoT.

Two key hypotheses were tested:

  1. Response-Length Hypothesis (RL-H): CCoT will produce shorter responses compared to standard CoT.
  2. Performance Hypothesis (P-H): CCoT will not degrade problem-solving performance compared to standard CoT.

Metrics for evaluation included response length (measured in tokens) and correct-answer accuracy. Statistical significance was assessed using the Mann-Whitney U test.

Results

The experimental results substantiate several crucial points:

  • Response Length Reduction: CCoT significantly reduced the average response length by 48.70% across both GPT-3.5 and GPT-4 models. For GPT-3.5 and GPT-4, CCoT reduced response length by 47.62% and 49.77% respectively, with both reductions being statistically significant.
  • Performance Impact: CCoT's impact on correct-answer accuracy was negligible overall. GPT-3.5 showed a non-significant 2.95% decrease in accuracy, while GPT-4's performance remained virtually unchanged. However, for math-related problems, GPT-3.5 experienced a notable 27.69% reduction in accuracy when using CCoT.

These findings imply that while GPT-4 maintains performance across all problem domains with CCoT, GPT-3.5's performance drops specifically in mathematical tasks.

Practical Implications

The reduction in response length translates directly into cost savings due to the token-based pricing of LLM APIs. The total cost savings were calculated to be 21.85% for GPT-3.5 and 23.49% for GPT-4, making CCoT an attractive technique for cost-conscious AI systems engineers. Additionally, reduced token usage can lead to lower energy consumption and shorter response times.

Theoretical Implications and Future Work

The results indicate that only a subset of the tokens in a CoT prompt are necessary for effective problem-solving. This finding encourages further investigation into which aspects of CoT are essential and which can be considered superfluous. Future research could expand upon this paper by:

  • Applying CCoT to other LLM architectures beyond the GPT family.
  • Experimenting with different variations of CCoT prompts.
  • Exploring additional task types and broader problem domains.
  • Conducting detailed error analysis to understand specific shortcomings of CCoT.

Conclusion

The investigation into CCoT provides valuable insights into optimizing prompt engineering for LLMs. By proving that LLMs can retain their problem-solving efficacy with reduced verbosity in their responses, CCoT offers a practical approach to enhancing the efficiency and cost-effectiveness of AI systems. This paper thus serves as a foundational paper prompting both theoretical exploration and practical adoption of more concise prompting strategies in LLM applications.

Youtube Logo Streamline Icon: https://streamlinehq.com