Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generating Natural Language Proofs with Verifier-Guided Search (2205.12443v3)

Published 25 May 2022 in cs.CL, cs.LG, and cs.LO

Abstract: Reasoning over natural language is a challenging problem in NLP. In this work, we focus on proof generation: Given a hypothesis and a set of supporting facts, the model generates a proof tree indicating how to derive the hypothesis from supporting facts. Compared to generating the entire proof in one shot, stepwise generation can better exploit the compositionality and generalize to longer proofs but has achieved limited success on real-world data. Existing stepwise methods struggle to generate proof steps that are both logically valid and relevant to the hypothesis. Instead, they tend to hallucinate invalid steps given the hypothesis. In this paper, we present a novel stepwise method, NLProofS (Natural Language Proof Search), which learns to generate relevant steps conditioning on the hypothesis. At the core of our approach, we train an independent verifier to check the validity of the proof steps to prevent hallucination. Instead of generating steps greedily, we search for proofs maximizing a global proof score judged by the verifier. NLProofS achieves state-of-the-art performance on EntailmentBank and RuleTaker. Specifically, it improves the correctness of predicted proofs from 27.7% to 33.3% in the distractor setting of EntailmentBank, demonstrating the effectiveness of NLProofS in generating challenging human-authored proofs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Kaiyu Yang (24 papers)
  2. Jia Deng (93 papers)
  3. Danqi Chen (84 papers)
Citations (58)

Summary

Generating Natural Language Proofs with Verifier-Guided Search

The paper "Generating Natural Language Proofs with Verifier-Guided Search" discusses a novel approach—NLProofS—for generating natural language proofs in NLP. Given the intrinsic challenges of reasoning in natural language, the authors focus on proof generation where a model formulates a proof tree through supporting facts to derive a given hypothesis. This task demands compositional generalization, which is often an obstacle for current LLMs due to issues like hallucinating invalid proof steps. To address this, the paper proposes NLProofS, a stepwise method supported by an independent verifier.

Key Methodology and Contributions

NLProofS improves proof generation by conditioning each proof step on the hypothesis and utilizing an independent verifier to evaluate the logical validity of generated steps, thus addressing previous models' tendencies to hallucinate steps. Unlike greedy stepwise generation methods, NLProofS leverages search strategies to maximize a global proof score judged by the verifier, which aggregates individual step scores. This process involves a stepwise prover—fine-tuned from a T5 model—that generates candidate steps, and a verifier—based on RoBERTa—that scores their validity, allowing the method to focus on both valid and relevant proof steps.

The paper introduces the proof graph, a directed acyclic graph (DAG) representing the search space for proofs, allowing NLProofS to explore various proof paths robustly. At inference, the model refines its proofs by considering multiple potential arguments and selecting the most valid according to calculated validity scores.

Empirical Results and Implications

NLProofS achieved state-of-the-art results on both the EntailmentBank and RuleTaker datasets. Specifically, in EntailmentBank's Task 2 setting, NLProofS improved the Overall-AllCorrect from 27.7% to 33.3%, indicating its capacity to generate more precise human-authored proofs compared to previous methods. On RuleTaker, which uses template-generated English sentences, NLProofS maintained competitive performance, underscoring its versatility across different benchmarks.

The results suggest that incorporating verifiers can significantly boost proof accuracy by guiding the model away from generating invalid steps purely based on the proximity to the hypothesis. The verifier acts as a corrective measure during proof search, reducing hallucination and promoting logical consistency in generated proofs.

Challenges and Future Directions

Despite its efficacy, the paper acknowledges several areas for improvement. The prover's beam search strategy might lead to redundancies, and generated proof steps often lack diversity. Future work could explore advanced search techniques like Diverse Beam Search to enhance candidate diversity. Additionally, the approach inherently depends on text concatenation methods that might not scale well with larger datasets or more complex sentences, suggesting a need for more scalable techniques.

Another notable aspect for future exploration is how NLProofS can be applied to broader NLP tasks beyond proof generation, such as multi-hop QA or fact verification, where structured reasoning is pivotal.

Conclusion

NLProofS represents a significant advancement in generating structured natural language proofs, crucial for explainable AI systems. The verifier-guided search framework not only improves accuracy but also opens pathways for creating more trustworthy and logically consistent NLP models. As automated reasoning continues to evolve, integrating verifier mechanisms could enhance the robustness and reliability of reasoning systems across diverse applications.