Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic (2402.14798v3)

Published 22 Feb 2024 in cs.CL and cs.AI

Abstract: Recent LLMs enable new opportunities for structured reasoning with text, such as the construction of intuitive, proof-like textual entailment trees without relying on brittle formal logic. However, progress in this direction has been hampered by a long-standing lack of a clear protocol for determining what valid compositional entailment is. This absence causes noisy datasets and limited performance gains by modern neuro-symbolic engines. To address these problems, we formulate a consistent and theoretically grounded approach to annotating decompositional entailment and evaluate its impact on LLM-based textual inference. We find that our new dataset, RDTE (Recognizing Decompositional Textual Entailment), has a substantially higher internal consistency (+9%) than prior decompositional entailment datasets. We also find that training an RDTE-oriented entailment classifier via knowledge distillation and employing it in an entailment tree reasoning engine significantly improves both accuracy and proof quality, illustrating the practical benefit of this advance for textual inference.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Gabor Angeli and Christopher D. Manning. 2014. NaturalLI: Natural logic inference for common sense reasoning. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 534–545. Association for Computational Linguistics.
  2. J Anthony Blair. 2012. Relevance, acceptability and sufficiency today. Groundwork in the Theory of Argumentation: Selected Papers of J. Anthony Blair, pages 87–100.
  3. Natural language deduction through search over statement compositions. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4871–4883. Association for Computational Linguistics.
  4. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642. Association for Computational Linguistics.
  5. Uncertain natural language inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8772–8779. Association for Computational Linguistics.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  7. BaRDa: A belief and reasoning dataset that separates factual accuracy and reasoning ability. arXiv preprint arXiv:2312.07527.
  8. Selection-inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations.
  9. The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop.
  10. Explaining answers with entailment trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7358–7370. Association for Computational Linguistics.
  11. Transforming question answering datasets into natural language inference datasets. arXiv preprint arXiv:1809.02922.
  12. Ambifc: Fact-checking ambiguous claims with evidence. Transactions of the Association for Computational Linguistics, 12:1–18.
  13. Leo Groarke. 2022. Informal Logic. In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy, winter 2022 edition. Metaphysics Research Lab, Stanford University.
  14. Capturing the varieties of natural language inference: A systematic survey of existing datasets and two novel benchmarks. Journal of Logic, Language and Information, pages 1–28.
  15. Logical fallacy detection. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7180–7198. Association for Computational Linguistics.
  16. Ralph H. Johnson and J. Anthony Blair. 1977. Logical self-defense.
  17. Qasc: A dataset for question answering via sentence composition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8082–8090.
  18. Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362.
  19. Christopher D Manning. 2006. Local textual inference: it’s hard to circumscribe, but you know it when you see it–and nlp needs it.
  20. Raquel Mochales Palau and Marie-Francine Moens. 2009. Argumentation mining: the detection, classification and structure of arguments in text. In Proceedings of the 12th International Conference on Artificial Intelligence and Law, ICAIL ’09, page 98–107, New York, NY, USA. Association for Computing Machinery.
  21. Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics, 7:677–694.
  22. Nils Reimers and Iryna Gurevych. 2019a. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992. Association for Computational Linguistics.
  23. Nils Reimers and Iryna Gurevych. 2019b. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  24. STREET: A MULTI-TASK STRUCTURED REASONING AND EXPLANATION BENCHMARK. In The Eleventh International Conference on Learning Representations.
  25. Generating summaries with controllable readability levels. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11669–11687. Association for Computational Linguistics.
  26. Okapi at trec-3. Nist Special Publication Sp, 109:109.
  27. Christian Stab and Iryna Gurevych. 2017. Recognizing insufficiently supported arguments in argumentative essays. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 980–990. Association for Computational Linguistics.
  28. Explanations in the wild. Cognition, 237.
  29. Entailer: Answering questions with faithful and truthful chains of reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2078–2093. Association for Computational Linguistics.
  30. Chenhao Tan. 2022. On the diversity and limits of human explanations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2173–2188. Association for Computational Linguistics.
  31. Do natural language explanations represent valid logical arguments? verifying entailment in explainable NLI gold standards. In Proceedings of the 14th International Conference on Computational Semantics (IWCS), pages 76–86. Association for Computational Linguistics.
  32. NELLIE: A neuro-symbolic inference engine for grounded, compositional, and explainable reasoning.
  33. Sarah Wiegreffe and Ana Marasovic. 2021. Teach me to explain: A review of datasets for explainable natural language processing. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  34. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.
  35. WorldTree v2: A corpus of science-domain structured explanations and inference patterns supporting multi-hop inference. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5456–5473. European Language Resources Association.
  36. Generating natural language proofs with verifier-guided search. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 89–105. Association for Computational Linguistics.
  37. Ordinal common-sense inference. Transactions of the Association for Computational Linguistics, 5:379–395.
Citations (6)

Summary

  • The paper presents a novel framework using RAS criteria to systematically annotate entailment trees in decompositional NLI.
  • It introduces the RDTE dataset with over 1,000 expert annotations, achieving a 9% improvement in internal consistency.
  • It develops TreeWise, which significantly boosts the quality and accuracy of proof-like entailment trees for complex QA tasks.

Enhancing Decompositional Natural Language Inference with Informal Logic for Systematic Reasoning

Introduction to Decompositional Natural Language Inference (NLI)

Decompositional Natural Language Inference (NLI) is a subfield that focuses on understanding and generating logical decompositions from textual content. This paper introduces an innovative approach by leveraging informal logic to enhance the performance and consistency of decompositional NLI. The methodology revolves around the construction and evaluation of entailment trees, which serve as structured arguments made by models to justify their conclusions.

Background and Motivation

The advent of LLMs has opened up new possibilities for NLI by enabling the generation of intuitive, proof-like textual entailments. Despite this progress, a lack of a clear protocol for what constitutes valid compositional entailment has hampered further advancements. This paper identifies this gap and proposes a novel framework aimed at refining the annotation and evaluation of decompositional entailment datasets. The introduction of the RDTE dataset underscores the need for a robust and consistent methodology to assess compositional entailment, as evidenced by its superior internal consistency (+9%) compared to previous datasets.

RAS Criteria and Annotation

The core of the proposed method is grounded in the "Relevance, Acceptability, and Sufficiency" (RAS) criteria from informal logic. These criteria provide a principled basis to evaluate the validity of arguments within entailment trees. The paper details a meticulous process of annotating decompositions based on RAS, introducing a higher degree of precision and nuance than existing binary judgments. This ordinal approach allows for a more granular assessment of arguments, addressing issues like relevance, redundancy, and sufficiency on a 5-point scale.

Data Collection and RDTE Dataset

Through a systematic annotation process, the paper generates the RDTE (Recognizing Decompositional Textual Entailment) dataset. This dataset contains over 1,000 expert annotations and benefits from a high level of internal consistency. It provides a challenging benchmark for models, with preliminary findings indicating that existing LLMs, including GPT-4, significantly underperform against human-level performance on this dataset.

Experiments and Findings

The paper reports on a series of experiments leveraging the RDTE protocol to evaluate the performance of various models and approaches, including knowledge distillation from GPT-4 to smaller, more efficient models. The results demonstrate notable improvements in both accuracy and the quality of proof-like entailment trees. The introduction of TreeWise, an entailment tree engine that incorporates RDTE-oriented models, marks a significant advancement, showing enhanced performance over existing methods in generating high-quality entailment trees for complex QA tasks.

Implications and Future Directions

The findings of this paper have profound implications for the development of explainable and trustworthy AI systems capable of complex reasoning tasks. By establishing a clear and principled framework for assessing decompositions, this work lays the groundwork for future improvements in NLI and related fields. The RDTE dataset serves as a valuable resource for advancing research, while TreeWise exemplifies the practical application of these advancements, offering a blueprint for future developments.

Conclusion

In summary, this paper presents a comprehensive approach to enhancing decompositional NLI through the lens of informal logic, culminating in the creation of the RDTE dataset and the development of TreeWise. These contributions represent a significant step forward in the quest for improved systematic reasoning in AI, with the potential to inform and inspire continued innovation in the field.

Limitations and Considerations

While RDTE and TreeWise mark notable advancements, their application and the generalizability of the RDTE protocol across different domains warrant further exploration. The domain-specific nature of argument sufficiency and the inherent potential for automated reasoning systems to amplify existing biases underline the need for cautious and considerate application of these technologies.

The exploration of these methodologies, datasets, and systems provides a compelling foundation for the future development of AI reasoning capabilities, guiding the way towards more accurate, transparent, and justifiable AI decision-making processes.