Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement (2310.08559v4)

Published 12 Oct 2023 in cs.CL and cs.AI

Abstract: The ability to derive underlying principles from a handful of observations and then generalize to novel situations -- known as inductive reasoning -- is central to human intelligence. Prior work suggests that LLMs (LMs) often fall short on inductive reasoning, despite achieving impressive success on research benchmarks. In this work, we conduct a systematic study of the inductive reasoning capabilities of LMs through iterative hypothesis refinement, a technique that more closely mirrors the human inductive process than standard input-output prompting. Iterative hypothesis refinement employs a three-step process: proposing, selecting, and refining hypotheses in the form of textual rules. By examining the intermediate rules, we observe that LMs are phenomenal hypothesis proposers (i.e., generating candidate rules), and when coupled with a (task-specific) symbolic interpreter that is able to systematically filter the proposed set of rules, this hybrid approach achieves strong results across inductive reasoning benchmarks that require inducing causal relations, language-like instructions, and symbolic concepts. However, they also behave as puzzling inductive reasoners, showing notable performance gaps between rule induction (i.e., identifying plausible rules) and rule application (i.e., applying proposed rules to instances), suggesting that LMs are proposing hypotheses without being able to actually apply the rules. Through empirical and human analyses, we further reveal several discrepancies between the inductive reasoning processes of LMs and humans, shedding light on both the potentials and limitations of using LMs in inductive reasoning tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. Communicating natural programs to humans and machines. Advances in Neural Information Processing Systems, 35:3731–3743, 2022.
  2. Do language models know when they’re hallucinating references? ArXiv preprint, abs/2305.18248, 2023. URL https://arxiv.org/abs/2305.18248.
  3. Lexicon learning for few shot sequence modeling. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4934–4946, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.382. URL https://aclanthology.org/2021.acl-long.382.
  4. A large-scale benchmark for few-shot program induction and synthesis. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  175–186. PMLR, 2021. URL http://proceedings.mlr.press/v139/alet21a.html.
  5. Learning with latent language. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  2166–2179, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1197. URL https://aclanthology.org/N18-1197.
  6. Anthropic. Claude 2, 2023. URL https://www.anthropic.com/index/claude-2.
  7. BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
  8. Grounding compositional hypothesis generation in specific instances. In Proceedings of the 40th annual conference of the cognitive science society, 2018.
  9. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. ArXiv preprint, abs/2211.12588, 2022. URL https://arxiv.org/abs/2211.12588.
  10. Teaching large language models to self-debug. ArXiv preprint, abs/2304.05128, 2023a. URL https://arxiv.org/abs/2304.05128.
  11. Do models explain themselves? counterfactual simulatability of natural language explanations. ArXiv preprint, abs/2307.08678, 2023b. URL https://arxiv.org/abs/2307.08678.
  12. François Chollet. On the measure of intelligence. ArXiv preprint, abs/1911.01547, 2019. URL https://arxiv.org/abs/1911.01547.
  13. Language models show human-like content effects on reasoning. ArXiv preprint, abs/2207.07051, 2022. URL https://arxiv.org/abs/2207.07051.
  14. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  1286–1305, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.98. URL https://aclanthology.org/2021.emnlp-main.98.
  15. Faith and fate: Limits of transformers on compositionality. ArXiv preprint, abs/2305.18654, 2023. URL https://arxiv.org/abs/2305.18654.
  16. Kevin Ellis. Modeling human-like concept learning with bayesian inference over natural language. ArXiv preprint, abs/2306.02797, 2023. URL https://arxiv.org/abs/2306.02797.
  17. Synthesizing theories of human language with bayesian program induction. Nature communications, 13(1):5024, 2022.
  18. Dreamcoder: growing generalizable, interpretable knowledge with wake–sleep bayesian program learning. Philosophical Transactions of the Royal Society A, 381(2251):20220050, 2023.
  19. From sensory signals to modality-independent conceptual representations: A probabilistic language of thought approach. PLoS computational biology, 11(11):e1004610, 2015.
  20. Jonathan St BT Evans and Jodie Curtis-Holmes. Rapid responding increases belief bias: Evidence for the dual-process theory of reasoning. Thinking & Reasoning, 11(4):382–389, 2005.
  21. Jerry A Fodor. The language of thought, volume 5. Harvard university press, 1975.
  22. Algorithms of adaptation in inductive inference. Cognitive Psychology, 137:101506, 2022.
  23. Kunihiko Fukushima. A neural network model for selective attention in visual pattern recognition. Biological Cybernetics, 55(1):5–15, 1986.
  24. Pal: Program-aided language models. In International Conference on Machine Learning, pp.  10764–10799. PMLR, 2023.
  25. Large language models are not abstract reasoners. ArXiv preprint, abs/2305.19555, 2023. URL https://arxiv.org/abs/2305.19555.
  26. A rational analysis of rule-based concept learning. Cognitive science, 32(1):108–154, 2008.
  27. Causal learning mechanisms in very young children: two-, three-, and four-year-olds infer causal relations from patterns of variation and covariation. Developmental psychology, 37(5):620, 2001.
  28. Language models can teach themselves to program better. In The Eleventh International Conference on Learning Representations, 2022.
  29. Inductive reasoning in humans and large language models. Cognitive Systems Research, pp.  101155, 2023.
  30. Evan Heit. Properties of inductive reasoning. Psychonomic Bulletin & Review, 7:569–592, 2000.
  31. Instruction induction: From few examples to natural language task descriptions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1935–1952, Toronto, Canada, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.108. URL https://aclanthology.org/2023.acl-long.108.
  32. Code prompting: a neural symbolic method for complex reasoning in large language models. ArXiv preprint, abs/2305.18507, 2023. URL https://arxiv.org/abs/2305.18507.
  33. David Hume (ed.). Enquiry Concerning Human Understanding. Clarendon Press, 1904.
  34. Selfevolve: A code evolution framework via large language models. ArXiv preprint, abs/2306.02907, 2023. URL https://arxiv.org/abs/2306.02907.
  35. Fast and flexible: Human program induction in abstract reasoning tasks. arXiv preprint arXiv:2103.05823, 2021.
  36. Structured statistical models of inductive reasoning. Psychological review, 116(1):20, 2009.
  37. I speak, you verify: Toward trustworthy neural program synthesis. ArXiv preprint, abs/2210.00848, 2022. URL https://arxiv.org/abs/2210.00848.
  38. Uncontrolled lexical exposure leads to overestimation of compositional generalization in pretrained models. ArXiv preprint, abs/2212.10769, 2022a. URL https://arxiv.org/abs/2212.10769.
  39. Playgrounds for abstraction and reasoning. In NeurIPS 2022 Workshop on Neuro Causal and Symbolic AI (nCSI), 2022b. URL https://openreview.net/forum?id=F4RNpByoqP.
  40. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  41. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp.  2879–2888. PMLR, 2018. URL http://proceedings.mlr.press/v80/lake18a.html.
  42. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  43. Building machines that learn and think like people. Behavioral and Brain Sciences, 40:e253, 2017. doi: 10.1017/S0140525X16001837.
  44. Human few-shot learning of compositional instructions. In Annual Meeting of the Cognitive Science Society, 2019. URL https://api.semanticscholar.org/CorpusID:58006558.
  45. Benchmarking and improving generator-validator consistency of language models. ArXiv preprint, abs/2310.01846, 2023. URL https://arxiv.org/abs/2310.01846.
  46. Self-refine: Iterative refinement with self-feedback. ArXiv preprint, abs/2303.17651, 2023. URL https://arxiv.org/abs/2303.17651.
  47. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  157–165, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.18. URL https://aclanthology.org/2022.acl-short.18.
  48. Large language models as general pattern machines. ArXiv preprint, abs/2307.04721, 2023. URL https://arxiv.org/abs/2307.04721.
  49. The conceptarc benchmark: Evaluating understanding and generalization in the arc domain. ArXiv preprint, abs/2305.07141, 2023. URL https://arxiv.org/abs/2305.07141.
  50. Shaping visual representations with language for few-shot classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4823–4830, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.436. URL https://aclanthology.org/2020.acl-main.436.
  51. Allen Newell. Physical symbol systems. Cognitive science, 4(2):135–183, 1980.
  52. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, pp.  26106–26128. PMLR, 2023.
  53. Rule-plus-exception model of classification learning. Psychological review, 101(1):53, 1994.
  54. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  55. Learning compositional rules via neural program synthesis. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/7a685d9edd95508471a9d3d6fcace432-Abstract.html.
  56. Demystifying gpt self-repair for code generation. ArXiv preprint, abs/2306.09896, 2023. URL https://arxiv.org/abs/2306.09896.
  57. OpenAI. Gpt-4 technical report, 2023.
  58. Category-based induction. Psychological Review, 97(2):185–200, 1990. ISSN 0033-295X. doi: 10.1037/0033-295X.97.2.185.
  59. Revisiting the compositional generalization abilities of neural sequence models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  424–434, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.46. URL https://aclanthology.org/2022.acl-short.46.
  60. Check your facts and try again: Improving large language models with external knowledge and automated feedback. ArXiv preprint, abs/2302.12813, 2023. URL https://arxiv.org/abs/2302.12813.
  61. The logical primitives of thought: Empirical foundations for compositional cognitive models. Psychological review, 123(4):392, 2016.
  62. Joshua Stewart Rule. The child as hacker: building more human-like models of learning. PhD thesis, Massachusetts Institute of Technology, 2020.
  63. Bayesian synthesis of probabilistic programs for automatic data modeling. Proceedings of the ACM on Programming Languages, 3(POPL):1–32, 2019.
  64. A language of thought for the mental representation of geometric shapes. Cognitive Psychology, 139:101527, 2022.
  65. Are hard examples also harder to explain? a study with human and model-generated explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2121–2131, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.137.
  66. Explaining patterns in data with language models via interpretable autoprompting. ArXiv preprint, abs/2210.01848, 2022. URL https://arxiv.org/abs/2210.01848.
  67. Quasi-synchronous grammars: Alignment by soft projection of syntactic dependencies. In Proceedings on the Workshop on Statistical Machine Translation, pp.  23–30, New York City, June 2006. Association for Computational Linguistics. URL https://aclanthology.org/W06-3104.
  68. Children’s causal inferences from indirect evidence: Backwards blocking and bayesian reasoning in preschoolers. Cognitive science, 28(3):303–333, 2004.
  69. Large language models are in-context semantic reasoners rather than symbolic reasoners. ArXiv preprint, abs/2305.14825, 2023. URL https://arxiv.org/abs/2305.14825.
  70. Joshua Tenenbaum. Rules and similarity in concept learning. Advances in neural information processing systems, 12, 1999.
  71. How to grow a mind: Statistics, structure, and abstraction. Science, 331(6022):1279–1285, 2011. doi: 10.1126/science.1192788. URL https://www.science.org/doi/abs/10.1126/science.1192788.
  72. Online learning of symbolic concepts. Journal of Mathematical Psychology, 77:10–20, 2017.
  73. Learning abstract structure for drawing by efficient motor program induction. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1c104b9c0accfca52ef21728eaf01453-Abstract.html.
  74. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  75. Few-shot image classification by generating natural language rules. In ACL Workshop on Learning with Natural Language Supervision, 2022. URL https://openreview.net/forum?id=BxfpZP2sZq.
  76. Hypothesis search: Inductive reasoning with language models. ArXiv preprint, abs/2309.05660, 2023a. URL https://arxiv.org/abs/2309.05660.
  77. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, 2023b. URL https://openreview.net/pdf?id=1PL1NIMMrw.
  78. Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9):1526–1541, 2023.
  79. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  80. From word models to world models: Translating from natural language to the probabilistic language of thought. ArXiv preprint, abs/2306.12672, 2023. URL https://arxiv.org/abs/2306.12672.
  81. Are large language models really good logical reasoners? a comprehensive evaluation from deductive, inductive and abductive views. ArXiv preprint, abs/2306.09841, 2023a. URL https://arxiv.org/abs/2306.09841.
  82. Word learning as bayesian inference. Psychological review, 114(2):245, 2007.
  83. Llms and the abstraction and reasoning corpus: Successes, failures, and the importance of object-based representations. ArXiv preprint, abs/2305.18354, 2023b. URL https://arxiv.org/abs/2305.18354.
  84. Language models as inductive reasoners. ArXiv preprint, abs/2212.10923, 2022. URL https://arxiv.org/abs/2212.10923.
  85. ACRE: abstract causal reasoning beyond covariation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp.  10643–10653. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.01050. URL https://openaccess.thecvf.com/content/CVPR2021/html/Zhang_ACRE_Abstract_Causal_REasoning_Beyond_Covariation_CVPR_2021_paper.html.
  86. Algo: Synthesizing algorithmic programs with generated oracle verifiers. ArXiv preprint, abs/2305.14591, 2023a. URL https://arxiv.org/abs/2305.14591.
  87. How language model hallucinations can snowball. ArXiv preprint, abs/2305.13534, 2023b. URL https://arxiv.org/abs/2305.13534.
  88. Describing differences between text distributions with natural language. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  27099–27116. PMLR, 2022. URL https://proceedings.mlr.press/v162/zhong22a.html.
  89. Goal driven discovery of distributional differences via language descriptions. ArXiv preprint, abs/2302.14233, 2023. URL https://arxiv.org/abs/2302.14233.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Linlu Qiu (14 papers)
  2. Liwei Jiang (53 papers)
  3. Ximing Lu (52 papers)
  4. Melanie Sclar (12 papers)
  5. Valentina Pyatkin (34 papers)
  6. Chandra Bhagavatula (46 papers)
  7. Bailin Wang (34 papers)
  8. Yoon Kim (92 papers)
  9. Yejin Choi (287 papers)
  10. Nouha Dziri (40 papers)
  11. Xiang Ren (194 papers)
Citations (52)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com