Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization (2403.18120v1)

Published 26 Mar 2024 in cs.AI, cs.CL, and cs.LG

Abstract: LLMs (LLM), such as Google's Minerva and OpenAI's GPT families, are becoming increasingly capable of solving mathematical quantitative reasoning problems. However, they still make unjustified logical and computational errors in their reasoning steps and answers. In this paper, we leverage the fact that if the training corpus of LLMs contained sufficiently many examples of formal mathematics (e.g. in Isabelle, a formal theorem proving environment), they can be prompted to translate i.e. autoformalize informal mathematical statements into formal Isabelle code -- which can be verified automatically for internal consistency. This provides a mechanism to automatically reject solutions whose formalized versions are inconsistent within themselves or with the formalized problem statement. We evaluate our method on GSM8K, MATH and MultiArith datasets and demonstrate that our approach provides a consistently better heuristic than vanilla majority voting -- the previously best method to identify correct answers, by more than 12% on GSM8K. In our experiments it improves results consistently across all datasets and LLM model sizes. The code can be found at https://github.com/jinpz/dtv.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Towards a mathematics formalisation assistant using large language models. arXiv preprint arXiv:2211.07524, 2022.
  2. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433, 2023.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  4. The Coq proof assistant reference manual: Version 6.1. PhD thesis, Inria, 1997.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Davide Castelvecchi et al. Mathematicians welcome computer-assisted proof in ‘grand unification’theory. Nature, 595(7865):18–19, 2021.
  7. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2022.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. Selection-inference: Exploiting large language models for interpretable logical reasoning, 2022.
  11. The lean theorem prover (system description). In Automated Deduction-CADE-25: 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings 25, pp.  378–388. Springer, 2015.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. Complexity-based prompting for multi-step reasoning, 2023.
  14. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.
  15. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  16. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  17. Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398, 2023.
  18. Atlas: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv, 2208, 2022.
  19. Lisa: Language models of isabelle proofs. In 6th Conference on Artificial Intelligence and Theorem Proving, pp.  378–392, 2021.
  20. Draft, Sketch, and Prove: Guiding formal theorem provers with informal proofs. In International Conference on Learning Representations, 2023. URL https://doi.org/10.48550/arXiv.2210.12283.
  21. Maieutic prompting: Logically consistent reasoning with recursive explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  1266–1279, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.82.
  22. A survey on semantic parsing. arXiv preprint arXiv:1812.00978, 2018.
  23. Decomposed prompting: A modular approach for solving complex tasks, 2023.
  24. sel4: Formal verification of an os kernel. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pp.  207–220, 2009.
  25. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  26. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115, 2022.
  27. Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=IFXTZERXdM7.
  28. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  29. Isabelle/hol: A proof assistant for higher-order logic. 2002.
  30. OpenAI. Introducing chatgpt. URL https://openai.com/blog/chatgpt.
  31. OpenAI. Gpt-4 technical report, 2023.
  32. Three years of experience with sledgehammer, a practical link between automatic and interactive theorem provers. In Proceedings of the 8th International Workshop on the Implementation of Logics (IWIL-2010), Yogyakarta, Indonesia. EPiC, volume 2, 2012.
  33. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413, 2016.
  34. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
  35. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  36. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
  37. Christian Szegedy. A promising path towards autoformalization and general artificial intelligence. In Intelligent Computer Mathematics: 13th International Conference, CICM 2020, Bertinoro, Italy, July 26–31, 2020, Proceedings 13, pp.  3–20. Springer, 2020.
  38. Exploration of neural machine translation in autoformalization of mathematics in mizar. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, pp.  85–98, 2020.
  39. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  40. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
  41. Naturalprover: Grounded mathematical proof generation with language models, 2022.
  42. Freek Wiedijk. Formal proof sketches. Lecture notes in computer science, 3085:378–393, 2004.
  43. Freek Wiedijk. Formal proof–getting started. 2008.
  44. Autoformalization with large language models. Advances in Neural Information Processing Systems, 35:32353–32368, 2022.
  45. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
  46. Progressive-hint prompting improves reasoning in large language models, 2023.
  47. Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110, 2021.
Citations (15)

Summary

We haven't generated a summary for this paper yet.