Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ATG: Benchmarking Automated Theorem Generation for Generative Language Models (2405.06677v1)

Published 5 May 2024 in cs.CL and cs.AI

Abstract: Humans can develop new theorems to explore broader and more complex mathematical results. While current generative LLMs (LMs) have achieved significant improvement in automatically proving theorems, their ability to generate new or reusable theorems is still under-explored. Without the new theorems, current LMs struggle to prove harder theorems that are distant from the given hypotheses with the exponentially growing search space. Therefore, this paper proposes an Automated Theorem Generation (ATG) benchmark that evaluates whether an agent can automatically generate valuable (and possibly brand new) theorems that are applicable for downstream theorem proving as reusable knowledge. Specifically, we construct the ATG benchmark by splitting the Metamath library into three sets: axioms, library, and problem based on their proving depth. We conduct extensive experiments to investigate whether current LMs can generate theorems in the library and benefit the problem theorems proving. The results demonstrate that high-quality ATG data facilitates models' performances on downstream ATP. However, there is still room for current LMs to develop better ATG and generate more advanced and human-like theorems. We hope the new ATG challenge can shed some light on advanced complex theorem proving.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Graph element networks: adaptive, structured computation and memory. In International Conference on Machine Learning, pages 212–222. PMLR.
  2. Neural operator: Graph kernel network for partial differential equations. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations.
  3. Yoshua Bengio and Nikolay Malkin. 2024. Machine learning and information theory concepts towards an AI Mathematician. arXiv e-prints, page arXiv:2403.04571.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Bruno Buchberger. 2004. Algorithm-supported mathematical theory exploration: A personal view and strategy. In AISC, pages 236–250. Springer.
  6. Theorema: Towards computer-aided mathematical theory exploration. Journal of applied logic, 4(4):470–504.
  7. Parallel monte-carlo tree search. In Computers and Games: 6th International Conference, CG 2008, Beijing, China, September 29-October 1, 2008. Proceedings 6, pages 60–71. Springer.
  8. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  9. A deductive database approach to automated geometry theorem proving and discovering. Journal of Automated Reasoning, 25(3):219–246.
  10. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  11. Simon Colton. 2001. Automated theory formation in pure mathematics. Ph.D. thesis, University of Edinburgh.
  12. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  13. Baldur: Whole-proof generation and repair with large language models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1229–1241.
  14. Proof artifact co-training for theorem proving with language models. In International Conference on Learning Representations.
  15. Linear programming word problems formulation using ensemblecrf ner labeler and t5 text generator with data augmentations. arXiv preprint arXiv:2212.14657.
  16. Deepmath-deep sequence models for premise selection. Advances in neural information processing systems, 29.
  17. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  18. Moa Johansson. 2009. Automated discovery of inductive lemmas.
  19. Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
  20. Hypertree proof search for neural theorem proving. Advances in Neural Information Processing Systems, 35:26337–26349.
  21. Douglas B Lenat. 1977. Automated theory formation in mathematics. In IJCAI, volume 77, pages 833–842.
  22. Douglas Bruce Lenat. 1976. AM: an artificial intelligence approach to discovery in mathematics as heuristic search. Stanford University.
  23. Let’s verify step by step. arXiv preprint arXiv:2305.20050.
  24. Roy L McCasland and Alan Bundy. 2006. Mathsaid: a mathematical theorem discovery tool. In 2006 Eighth International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pages 17–22. IEEE.
  25. Ascertaining mathematical theorems. Electronic Notes in Theoretical Computer Science, 151(1):21–38.
  26. Norman Megill and David A Wheeler. 2019. Metamath: a computer language for mathematical proofs. Lulu. com.
  27. Scheme-based theorem discovery and concept invention. Expert systems with applications, 39(2):1637–1646.
  28. OpenAI. 2023. Gpt-4 technical report.
  29. Formal mathematics statement curriculum learning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  30. Stanislas Polu and Ilya Sutskever. 2020. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393.
  31. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  32. Augmenting operations research with auto-formulation of optimization models from problem descriptions. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: EMNLP 2022 - Industry Track, Abu Dhabi, UAE, December 7 - 11, 2022, pages 29–62. Association for Computational Linguistics.
  33. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  34. Unsupervised translation of programming languages. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  35. Learning a SAT solver from single-bit supervision. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  36. Mastering the game of go without human knowledge. nature, 550(7676):354–359.
  37. A grand challenge of theorem discovery. In Proceedings of the Workshop on Challenges and Novel Applications for Automated Reasoning, 19th International Conference on Automated Reasoning, pages 1–11.
  38. Dt-solver: Automated theorem proving with dynamic-tree sampling guided by proof-level value function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12632–12646.
  39. Mingzhe Wang and Jia Deng. 2020. Learning to prove theorems by learning to generate theorems. Advances in Neural Information Processing Systems, 33:18146–18157.
  40. Premise selection for theorem proving by deep graph embedding. Advances in neural information processing systems, 30.
  41. Deep neural solver for math word problems. In Proceedings of the 2017 conference on empirical methods in natural language processing, pages 845–854.
  42. Daniel Whalen. 2016. Holophrasm: a neural automated theorem prover for higher-order logic. arXiv preprint arXiv:1608.02644.
  43. INT: an inequality benchmark for evaluating generalization in theorem proving. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  44. Autoformalization with large language models. Advances in Neural Information Processing Systems, 35:32353–32368.
  45. Leandojo: Theorem proving with retrieval-augmented language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  46. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
Citations (4)

Summary

We haven't generated a summary for this paper yet.