Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deductive Verification of Chain-of-Thought Reasoning (2306.03872v3)

Published 6 Jun 2023 in cs.CL, cs.AI, and cs.LG
Deductive Verification of Chain-of-Thought Reasoning

Abstract: LLMs significantly benefit from Chain-of-Thought (CoT) prompting in performing various reasoning tasks. While CoT allows models to produce more comprehensive reasoning processes, its emphasis on intermediate reasoning steps can inadvertently introduce hallucinations and accumulated errors, thereby limiting models' ability to solve complex reasoning tasks. Inspired by how humans engage in careful and meticulous deductive logical reasoning processes to solve tasks, we seek to enable LLMs to perform explicit and rigorous deductive reasoning, and also ensure the trustworthiness of their reasoning process through self-verification. However, directly verifying the validity of an entire deductive reasoning process is challenging, even with advanced models like ChatGPT. In light of this, we propose to decompose a reasoning verification process into a series of step-by-step subprocesses, each only receiving their necessary context and premises. To facilitate this procedure, we propose Natural Program, a natural language-based deductive reasoning format. Our approach enables models to generate precise reasoning steps where subsequent steps are more rigorously grounded on prior steps. It also empowers LLMs to carry out reasoning self-verification in a step-by-step manner. By integrating this verification process into each deductive reasoning stage, we significantly enhance the rigor and trustfulness of generated reasoning steps. Along this process, we also improve the answer correctness on complex reasoning tasks. Code will be released at https://github.com/lz1oceani/verify_cot.

Deductive Verification of Chain-of-Thought Reasoning

The paper "Deductive Verification of Chain-of-Thought Reasoning" explores enhancing LLMs through a rigorous verification approach that mitigates common issues associated with Chain-of-Thought (CoT) prompting. While CoT prompting aids in producing comprehensive reasoning, it is susceptible to hallucinations and errors, necessitating a reliable verification mechanism.

Overview

The authors address a significant limitation of LLMs, which, despite their capabilities, often stumble on cogent reasoning due to accumulated errors in intermediate steps. Inspired by human deductive reasoning processes, this paper introduces a structured approach to break down reasoning verification into manageable subprocesses. This is achieved through the introduction of the "Natural Program," a format designed to enable precise and valid reasoning steps.

Methodology

The Natural Program format serves as the cornerstone of this approach. It ensures that each reasoning step is explicitly supported by necessary premises, curtailing instances of extraneous information that may hinder logical deductions. By leveraging this structured format, models are trained to perform reasoning verification iteratively, effectively identifying and addressing errors at each step before proceeding.

Significant emphasis is placed on decomposing the verification process. Rather than attempting to validate an entire reasoning chain at once, the paper advocates for a step-by-step confirmation, promoting accuracy and reducing the likelihood of oversight. The Natural Program format ensures that LLMs can self-verify, enhancing both the rigor and trustworthiness of the reasoning process.

Experimental Results

Experiments conducted across various datasets, particularly in arithmetic and commonsense reasoning, demonstrate the framework's efficacy. The application of deductive verification markedly improved the correctness of solutions on complex reasoning tasks, as evidenced by numerical evaluations on benchmarks such as GSM8K and MATH. Notably, the rigorous format allowed for coherent and traceable reasoning paths, improving overall performance.

Implications and Future Work

The implications of this research in AI are substantial. By instilling a rigorous verification method, LLMs can potentially be adapted to domains that demand high accuracy and reliability, such as legal reasoning or scientific research. Additionally, the reduction of hallucinations—a persistent issue in LLM deployment—enhances user trust and model applicability.

Future developments may focus on further refining the verification process, perhaps extending the Natural Program format to accommodate even more complex reasoning structures or integrating additional modules that allow context adaptation without retraining. Another avenue could involve exploring alternative means of detecting and addressing context irrelevancies in reasoning, thereby pushing the boundaries of what LLMs can achieve in terms of precise and reliable outputs.

In conclusion, the paper's contribution is a significant advancement towards creating more reliable and trustworthy AI systems through meticulous deductive verification of CoT reasoning, setting a foundational paradigm for future enhancements in LLM reasoning capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Why exposure bias matters: An imitation learning perspective of error accumulation in language generation. arXiv preprint arXiv:2204.01171, 2022.
  2. Natural language deduction through search over statement compositions. arXiv preprint arXiv:2201.06028, 2022.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  5. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  6. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  9. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  10. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  11. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022.
  12. Selection-inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations, 2023.
  13. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  14. Compositional semantic parsing with large language models. arXiv preprint arXiv:2209.15003, 2022.
  15. Roscoe: A suite of metrics for scoring step-by-step reasoning. In The Eleventh International Conference on Learning Representations, 2022.
  16. Hallucinations in large multilingual translation models. arXiv preprint arXiv:2303.16104, 2023.
  17. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  18. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  19. Learning to solve arithmetic word problems with verb categorization. In EMNLP, pages 523–533, 2014.
  20. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  21. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
  22. Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 271–281, Baltimore, Maryland, June 2014. Association for Computational Linguistics.
  23. Can language models learn from explanations in context? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 537–563, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
  24. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.
  25. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  26. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  27. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379, 2023.
  28. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  29. Few-shot self-rationalization with natural language prompts, 2022.
  30. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661, 2020.
  31. OpenAI. Gpt-4 technical report, 2023.
  32. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  33. Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904, 2023.
  34. Receval: Evaluating reasoning chains via correctness and informativeness. 2023.
  35. Street: A multi-task structured reasoning and explanation benchmark. arXiv preprint arXiv:2302.06729, 2023.
  36. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413, 2016.
  37. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
  38. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  39. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  40. Generate & rank: A multi-task framework for math word problems. arXiv preprint arXiv:2109.03034, 2021.
  41. Large language models can be easily distracted by irrelevant context. arXiv preprint arXiv:2302.00093, 2023.
  42. Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022.
  43. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  44. Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150, 2022.
  45. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  46. Proofwriter: Generating implications, proofs, and abductive statements over natural language. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 3621–3634. Association for Computational Linguistics, 2021.
  47. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  48. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  49. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  50. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  51. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019.
  52. Large language models are reasoners with self-verification. arXiv preprint arXiv:2212.09561, 2022.
  53. Generating natural language proofs with verifier-guided search. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
  54. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  55. Star: Self-taught reasoner bootstrapping reasoning with reasoning. 2022.
  56. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  57. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  58. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
  59. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
  60. Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhan Ling (16 papers)
  2. Yunhao Fang (11 papers)
  3. Xuanlin Li (18 papers)
  4. Zhiao Huang (28 papers)
  5. Mingu Lee (16 papers)
  6. Roland Memisevic (36 papers)
  7. Hao Su (217 papers)
Citations (93)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com