Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Likely Do LLMs with CoT Mimic Human Reasoning? (2402.16048v3)

Published 25 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Chain-of-thought emerges as a promising technique for eliciting reasoning capabilities from LLMs. However, it does not always improve task performance or accurately represent reasoning processes, leaving unresolved questions about its usage. In this paper, we diagnose the underlying mechanism by comparing the reasoning process of LLMs with humans, using causal analysis to understand the relationships between the problem instruction, reasoning, and the answer in LLMs. Our empirical study reveals that LLMs often deviate from the ideal causal chain, resulting in spurious correlations and potential consistency errors (inconsistent reasoning and answers). We also examine various factors influencing the causal structure, finding that in-context learning with examples strengthens it, while post-training techniques like supervised fine-tuning and reinforcement learning on human feedback weaken it. To our surprise, the causal structure cannot be strengthened by enlarging the model size only, urging research on new techniques. We hope that this preliminary study will shed light on understanding and improving the reasoning process in LLM.

Exploring Non-Causal Reasoning in LLMs Through the Lens of Chain-of-Thought

Introduction

LLMs have established prominence across a broad spectrum of complex problem-solving tasks. The adoption of the Chain of Thought (CoT) strategy has been a significant advancement, purportedly enhancing reasoning capabilities in these models. Contrary to the intuitive expectation associating correct CoTs with correct answers, empirical evidence suggests a surprising disjunction between them. Our paper scrutinizes this phenomenon through causal analysis, aiming to unearth the Structural Causal Model (SCM) LLMs implicitly adopt, contrasting it with human reasoning frameworks.

The Non-Causal Link Between CoT and Correct Answers

The examination of LLM performance across various tasks, employing both direct answering and CoT methodologies, unveils an inconsistent impact of CoT on accuracy. Notably, the presence of correct answers following incorrect CoTs, and vice versa, underscores a potential disconnect in the LLMs’ reasoning pathway. This discrepancy casts doubt on the presumption that LLMs engage in sequential causal reasoning analogous to human thought processes.

Demonstrating Causal Relationships in LLM Reasoning

Adopting cause-effect interventions, our analysis significantly contributes to the discourse by identifying varying implied SCMs across distinct tasks and models. Specifically, our findings categorize these causal structures into four definitive types, with "full connection" and "common cause" being prevalent among the examined tasks. This variance underscores the nuance within LLM reasoning capabilities and the influence of factors such as in-context learning and fine-tuning methodologies on the established causal links.

Implications on LLM Training and Reasoning

Our investigation sheds light on the profound impact of commonly deployed techniques in LLM training, including in-context learning, supervised fine-tuning, and reinforcement learning from human feedback. These methods, while proven to enhance task performance, do not consistently fortify the causal relationship between CoT and the resultant answers. This revelation poses essential considerations for future LLM training paradigms and underscores the necessity of adopting mechanisms that genuinely enhance reasoning capabilities.

Future Directions

The dissection of CoT through causal analysis presents a nascent but pivotal approach toward understanding and advancing LLM reasoning faculties. Future explorations might explore the granular aspects of reasoning processes, possibly incorporating other reasoning strategies and considering a broader task scope. Moreover, tweaking LLM training methodologies to align more closely with ideal causal reasoning structures remains an open avenue for research.

Conclusion

In summary, our paper accentuates the intricate landscape of LLM reasoning, marked by non-causal reasoning patterns and influenced by prevalent training techniques. By pioneering the causal analysis of CoT in LLMs, we hope to pave the way for further research aimed at cultivating more reliable and transparent LLM reasoning processes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Joshua Angrist and Guido Imbens. 1995. Identification and estimation of local average treatment effects.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  4. A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402.
  5. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  6. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  8. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  9. Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. Transactions of the Association for Computational Linguistics, 10:1138–1158.
  10. Giorgio Franceschelli and Mirco Musolesi. 2023. On the creativity of large language models. arXiv preprint arXiv:2304.00008.
  11. Causal reasoning through intervention. Causal learning: Psychology, philosophy, and computation, pages 86–100.
  12. Folio: Natural language reasoning with first-order logic. arXiv preprint arXiv:2209.00840.
  13. Mary Hegarty. 2004. Mechanical reasoning by mental simulation. Trends in cognitive sciences, 8(6):280–285.
  14. David R Heise. 1975. Causal analysis. John Wiley & Sons.
  15. Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398.
  16. Guido W Imbens and Donald B Rubin. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
  17. Mistral 7b.
  18. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  19. Efficient memory management for large language model serving with pagedattention.
  20. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702.
  21. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857.
  22. Pretrained language models for text generation: A survey. arXiv preprint arXiv:2201.05273.
  23. Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  24. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439.
  25. Emily McMilin. 2022. Selection bias induced spurious correlations in large language models. arXiv preprint arXiv:2207.08982.
  26. Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153–157.
  27. Representation learning via invariant causal mechanisms. In International Conference on Learning Representations.
  28. Collaborative storytelling with large-scale neural language models. In Proceedings of the 13th ACM SIGGRAPH Conference on Motion, Interaction and Games, pages 1–10.
  29. OpenAI. 2022. ChatGPT. https://chat.openai.com/.
  30. OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
  31. Training language models to follow instructions with human feedback.
  32. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. arXiv preprint arXiv:2305.12295.
  33. Judea Pearl. 2009. Causality. Cambridge university press.
  34. Elements of causal inference: foundations and learning algorithms. The MIT Press.
  35. Limitations of language models in arithmetic and symbolic induction. arXiv preprint arXiv:2208.05051.
  36. Improving language understanding by generative pre-training.
  37. Donald B Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688.
  38. John Schulman. 2023. Reinforcement learning from human feedback: Progress and challenges. In Berkley Electrical Engineering and Computer Sciences. URL: https://eecs. berkeley. edu/research/colloquium/230419 [accessed 2023-11-15].
  39. Do massively pretrained language models make better storytellers? arXiv preprint arXiv:1909.10705.
  40. Herbert A Simon. 1954. Spurious correlation: A causal interpretation. Journal of the American statistical Association, 49(267):467–479.
  41. Steven A Sloman and David Lagnado. 2015. Causality in thought. Annual review of psychology, 66:223–247.
  42. Musr: Testing the limits of chain-of-thought with multistep soft reasoning. arXiv preprint arXiv:2310.16049.
  43. Proofwriter: Generating implications, proofs, and abductive statements over natural language. arXiv preprint arXiv:2012.13048.
  44. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  45. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
  46. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv preprint arXiv:2305.04388.
  47. Attention is all you need. Advances in neural information processing systems, 30.
  48. Counterfactual invariance to spurious correlations: Why and how to pass stress tests. arXiv preprint arXiv:2106.00545.
  49. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
  50. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  51. Finetuned language models are zero-shot learners.
  52. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  53. Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021).
  54. Are large language models really good logical reasoners? a comprehensive evaluation from deductive, inductive and abductive views. arXiv preprint arXiv:2306.09841.
  55. Exploring the efficacy of automatically generated counterfactuals for sentiment analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 306–316, Online. Association for Computational Linguistics.
  56. Alignment for honesty. arXiv preprint arXiv:2312.07000.
  57. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
  58. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582.
  59. Causal parrots: Large language models may talk causality but are not causal. Transactions on Machine Learning Research.
  60. A survey of controllable text generation using transformer-based pre-trained language models. ACM Computing Surveys, 56(3):1–37.
  61. On the paradox of learning to reason from data. arXiv preprint arXiv:2205.11502.
  62. Ruiyi Zhang and Tong Yu. 2023. Understanding demonstration-based learning from a causal perspective. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1465–1475.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Guangsheng Bao (17 papers)
  2. Hongbo Zhang (54 papers)
  3. Linyi Yang (52 papers)
  4. Cunxiang Wang (30 papers)
  5. Yue Zhang (618 papers)
Citations (14)