Papers
Topics
Authors
Recent
Search
2000 character limit reached

How Likely Do LLMs with CoT Mimic Human Reasoning?

Published 25 Feb 2024 in cs.CL, cs.AI, and cs.LG | (2402.16048v3)

Abstract: Chain-of-thought emerges as a promising technique for eliciting reasoning capabilities from LLMs. However, it does not always improve task performance or accurately represent reasoning processes, leaving unresolved questions about its usage. In this paper, we diagnose the underlying mechanism by comparing the reasoning process of LLMs with humans, using causal analysis to understand the relationships between the problem instruction, reasoning, and the answer in LLMs. Our empirical study reveals that LLMs often deviate from the ideal causal chain, resulting in spurious correlations and potential consistency errors (inconsistent reasoning and answers). We also examine various factors influencing the causal structure, finding that in-context learning with examples strengthens it, while post-training techniques like supervised fine-tuning and reinforcement learning on human feedback weaken it. To our surprise, the causal structure cannot be strengthened by enlarging the model size only, urging research on new techniques. We hope that this preliminary study will shed light on understanding and improving the reasoning process in LLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Joshua Angrist and Guido Imbens. 1995. Identification and estimation of local average treatment effects.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  4. A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402.
  5. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  6. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  8. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  9. Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. Transactions of the Association for Computational Linguistics, 10:1138–1158.
  10. Giorgio Franceschelli and Mirco Musolesi. 2023. On the creativity of large language models. arXiv preprint arXiv:2304.00008.
  11. Causal reasoning through intervention. Causal learning: Psychology, philosophy, and computation, pages 86–100.
  12. Folio: Natural language reasoning with first-order logic. arXiv preprint arXiv:2209.00840.
  13. Mary Hegarty. 2004. Mechanical reasoning by mental simulation. Trends in cognitive sciences, 8(6):280–285.
  14. David R Heise. 1975. Causal analysis. John Wiley & Sons.
  15. Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398.
  16. Guido W Imbens and Donald B Rubin. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
  17. Mistral 7b.
  18. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  19. Efficient memory management for large language model serving with pagedattention.
  20. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702.
  21. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857.
  22. Pretrained language models for text generation: A survey. arXiv preprint arXiv:2201.05273.
  23. Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  24. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439.
  25. Emily McMilin. 2022. Selection bias induced spurious correlations in large language models. arXiv preprint arXiv:2207.08982.
  26. Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153–157.
  27. Representation learning via invariant causal mechanisms. In International Conference on Learning Representations.
  28. Collaborative storytelling with large-scale neural language models. In Proceedings of the 13th ACM SIGGRAPH Conference on Motion, Interaction and Games, pages 1–10.
  29. OpenAI. 2022. ChatGPT. https://chat.openai.com/.
  30. OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
  31. Training language models to follow instructions with human feedback.
  32. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. arXiv preprint arXiv:2305.12295.
  33. Judea Pearl. 2009. Causality. Cambridge university press.
  34. Elements of causal inference: foundations and learning algorithms. The MIT Press.
  35. Limitations of language models in arithmetic and symbolic induction. arXiv preprint arXiv:2208.05051.
  36. Improving language understanding by generative pre-training.
  37. Donald B Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688.
  38. John Schulman. 2023. Reinforcement learning from human feedback: Progress and challenges. In Berkley Electrical Engineering and Computer Sciences. URL: https://eecs. berkeley. edu/research/colloquium/230419 [accessed 2023-11-15].
  39. Do massively pretrained language models make better storytellers? arXiv preprint arXiv:1909.10705.
  40. Herbert A Simon. 1954. Spurious correlation: A causal interpretation. Journal of the American statistical Association, 49(267):467–479.
  41. Steven A Sloman and David Lagnado. 2015. Causality in thought. Annual review of psychology, 66:223–247.
  42. Musr: Testing the limits of chain-of-thought with multistep soft reasoning. arXiv preprint arXiv:2310.16049.
  43. Proofwriter: Generating implications, proofs, and abductive statements over natural language. arXiv preprint arXiv:2012.13048.
  44. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  45. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
  46. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv preprint arXiv:2305.04388.
  47. Attention is all you need. Advances in neural information processing systems, 30.
  48. Counterfactual invariance to spurious correlations: Why and how to pass stress tests. arXiv preprint arXiv:2106.00545.
  49. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
  50. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  51. Finetuned language models are zero-shot learners.
  52. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  53. Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021).
  54. Are large language models really good logical reasoners? a comprehensive evaluation from deductive, inductive and abductive views. arXiv preprint arXiv:2306.09841.
  55. Exploring the efficacy of automatically generated counterfactuals for sentiment analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 306–316, Online. Association for Computational Linguistics.
  56. Alignment for honesty. arXiv preprint arXiv:2312.07000.
  57. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
  58. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582.
  59. Causal parrots: Large language models may talk causality but are not causal. Transactions on Machine Learning Research.
  60. A survey of controllable text generation using transformer-based pre-trained language models. ACM Computing Surveys, 56(3):1–37.
  61. On the paradox of learning to reason from data. arXiv preprint arXiv:2205.11502.
  62. Ruiyi Zhang and Tong Yu. 2023. Understanding demonstration-based learning from a causal perspective. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1465–1475.
Citations (14)

Summary

  • The paper demonstrates that CoT inconsistently influences LLM performance, impairing basic tasks while aiding complex reasoning.
  • The study employs causal analysis and structural causal models to compare LLM reasoning with human logical processes.
  • Findings highlight that training methods like RLHF can partially align model reasoning with causal logic, suggesting paths for future improvement.

How Likely Do LLMs with CoT Mimic Human Reasoning?

Introduction

The paper "How Likely Do LLMs with CoT Mimic Human Reasoning?" examines the efficacy of the Chain of Thought (CoT) approach in LLMs, particularly focusing on its reasoning capabilities. The authors investigate whether CoT reasoning accurately reflects human-like logical processes, emphasizing the effects of CoT on LLM performance across different tasks. The study aims to explore discrepancies in reasoning patterns between LLMs and humans potentially introduced by CoT methodologies.

Methodology

The authors leverage causal analysis to assess how CoT prompts influence reasoning in LLMs. They examine the Structural Causal Models (SCMs) implicated by these models during reasoning tasks, comparing them against typical human reasoning patterns. The significance of the relationship between CoT instructions and resulting answers was tested using interventions and causal inference methodologies. The impact of common training techniques, such as in-context learning (ICL), supervised fine-tuning (SFT), and reinforcement learning with human feedback (RLHF) on these causal structures was also thoroughly analyzed. Figure 1

Figure 1: CoT and Answer do not fully align.

Findings and Analysis

Unstable Effectiveness of CoT: The CoT mechanism was shown to inconsistently enhance performance across different tasks. For basic arithmetic tasks, CoT impaired performance, whereas for complex reasoning problems it was beneficial, suggesting that simple tasks might not benefit from structured reasoning steps anticipated in CoT.

Incongruency Between CoT and Answers: A remarkable observation throughout the evaluations highlighted that incorrect CoT could still lead to correct answers, and vice versa, raising questions about the causal relations between steps of reasoning and outcomes provided by the models.

Impact of Model Size and Training Techniques: Larger LLMs generally demonstrated a structure more akin to effective causal reasoning, although this was not uniformly observed across all tasks. Moreover, while SFT was found to frequently infer extraneous features resulting in misalignment between CoT and answer accuracy, RLHF partially mitigated this issue by aligning answers more closely with the correct causal pathway. Figure 2

Figure 2: LLMs with CoT exhibit inconsistent effects, where CoT shows inferior performance to Direct in some tasks, superior in others.

Causal Structures and Errors

The study revealed various SCM types based on the strength of relationships between Instructions, CoT, and Answers. Many LLMs exhibited SCM types that allowed for spurious dependencies influencing the relation between CoT and answers. Crucially, these spurious correlations were evident in producing logical inconsistencies, an issue less prevalent in ideal causal chains where reasoning strictly follows logical progression. Figure 3

Figure 3: Potential structural causal models (SCMs) for chain-of-thought (CoT) in question-answering.

Practical and Theoretical Implications

The findings underscore significant implications for developing models that reason in a manner faithful to human-like logical processes. Alterations in training methods may help align CoT mechanisms with human reasoning more closely. Improved emphasis on causal consistency could enhance the reliability of models, especially when reasoning through complex, context-rich problems. Further research could investigate more refined causal structures or employ counterfactual reasoning integration to bridge existing gaps in LLM reasoning fidelity.

Conclusion

This research provides insights into the cognitive alignment challenges of LLMs when adopting CoT strategies. By uncovering inconsistencies between CoT reasoning and model decision-making processes, it suggests that current CoT implementations may not suffice for faithfully mirroring human logic. Future directions could explore more causally sound approaches for LLM training, or develop novel CoT paradigms that reinforce genuine causal dependencies akin to human reasoning mechanisms.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 5 likes about this paper.