Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Assessing Logical Reasoning Capabilities of Encoder-Only Transformer Models (2312.11720v2)

Published 18 Dec 2023 in cs.CL and cs.AI

Abstract: Logical reasoning is central to complex human activities, such as thinking, debating, and planning; it is also a central component of many AI systems as well. In this paper, we investigate the extent to which encoder-only transformer LLMs (LMs) can reason according to logical rules. We ask whether those LMs can deduce theorems in propositional calculus and first-order logic; if their relative success in these problems reflects general logical capabilities; and which layers contribute the most to the task. First, we show for several encoder-only LMs that they can be trained, to a reasonable degree, to determine logical validity on various datasets. Next, by cross-probing fine-tuned models on these datasets, we show that LMs have difficulty in transferring their putative logical reasoning ability, which suggests that they may have learned dataset-specific features, instead of a general capability. Finally, we conduct a layerwise probing experiment, which shows that the hypothesis classification task is mostly solved through higher layers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Yonatan Belinkov. 2022. Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics, 48(1):207–219.
  2. Longformer: The Long-Document Transformer. CoRR, abs/2004.05150.
  3. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv preprint arXiv:1803.05457.
  4. Transformers as Soft Reasoners over Language. arXiv preprint arXiv:2002.05867.
  5. Training Verifiers to Solve Math Word Problems.
  6. Unsupervised cross-lingual representation learning at scale.
  7. What you Can Cram into a Single $&!#* Vector: Probing Sentence Embeddings for Linguistic Properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia. Association for Computational Linguistics.
  8. Knowledge Neurons in Pretrained Transformers. CoRR, abs/2104.08696.
  9. The Paradox of the Compositionality of Natural Language: A Neural Machine Translation Case Study. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4154–4175, Dublin, Ireland. Association for Computational Linguistics.
  10. Language Models Show Human-Like Content Effects on Reasoning.
  11. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.04805.
  12. Faith and Fate: Limits of Transformers on Compositionality.
  13. Kawin Ethayarajh. 2019. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. CoRR, abs/1909.00512.
  14. Logically Consistent Adversarial Attacks for Soft Theorem Provers.
  15. Wes Gurnee and Max Tegmark. 2023. Language models represent space and time.
  16. Teaching Temporal Logics to Neural Networks.
  17. FOLIO: Natural Language Reasoning with First-Order Logic. arXiv preprint arXiv:2209.00840.
  18. Visualizing and Understanding the Effectiveness of BERT. CoRR, abs/1908.05620.
  19. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. CoRR, abs/2006.03654.
  20. Measuring Mathematical Problem Solving with the Math Dataset. arXiv preprint arXiv:2103.03874.
  21. John Hewitt and Christopher D. Manning. 2019. A Structural Probe for Finding Syntax in Word Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics.
  22. An Analysis of Natural Language Inference Benchmarks through the Lens of Negation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9106–9118, Online. Association for Computational Linguistics.
  23. What Does BERT Learn about the Structure of Language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics.
  24. Negation, Coordination, and Quantifiers in Contextualized Language Models. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3074–3085, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  25. Nora Kassner and Hinrich Schütze. 2020. Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7811–7818, Online. Association for Computational Linguistics.
  26. RACE: Large-scale ReAding Comprehension Dataset From Examinations.
  27. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
  28. Emmy Liu and Graham Neubig. 2022. Are Representations Built from the Ground Up? An Empirical Examination of Local Composition in Language Models.
  29. LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning.
  30. Probing Across Time: What Does RoBERTa Know and When? CoRR, abs/2104.07885.
  31. Roberta: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692.
  32. Emergent Linguistic Structure in Artificial Neural Networks Trained by Self-Supervision. Proceedings of the National Academy of Sciences, 117(48):30046–30054.
  33. I Wish I Would Have Loved This One, But I Didn’t–A Multilingual Dataset for Counterfactual Detection in Product Reviews. arXiv preprint arXiv:2104.06893.
  34. Are NLP Models Really Able to Solve Simple Math Word Problems?
  35. Counterfactual Story Reasoning and Generation.
  36. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. CoRR, abs/2112.11446.
  37. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8:842–866.
  38. Stuart Russell and Peter Norvig. 2010. Artificial Intelligence: A Modern Approach, 3 edition. Prentice Hall.
  39. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.
  40. Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought.
  41. Explaining Contextualization in Language Models using Visual Analytics. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 464–476, Online. Association for Computational Linguistics.
  42. CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4506–4515, Hong Kong, China. Association for Computational Linguistics.
  43. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  44. Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners.
  45. BERT Rediscovers the Classical NLP Pipeline. CoRR, abs/1905.05950.
  46. What do you Learn from Context? Probing for Sentence Structure in Contextualized Word Representations. CoRR, abs/1905.06316.
  47. Diagnosing the First-Order Logical Reasoning Ability Through LogicNLI. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3738–3747, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  48. Large Language Models Still Can’t Plan (A Benchmark for LLMs on Planning and Reasoning about Change).
  49. BLiMP: The Benchmark of Linguistic Minimal Pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392.
  50. Emergent Abilities of Large Language Models.
  51. Infusing Finetuning with Semantic Dependencies. Transactions of the Association for Computational Linguistics, 9:226–242.
  52. SemEval-2020 Task 5: Counterfactual Recognition. arXiv preprint arXiv:2008.00563.
  53. Xlnet: Generalized autoregressive pretraining for language understanding.
  54. ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning.
  55. On the Paradox of Learning to Reason from Data. CoRR, abs/2205.11502.
  56. When Do You Need Billions of Words of Pretraining Data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1112–1125, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Paulo Pirozelli (7 papers)
  2. Marcos M. José (6 papers)
  3. Paulo de Tarso P. Filho (1 paper)
  4. Anarosa A. F. Brandão (4 papers)
  5. Fabio G. Cozman (13 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com