Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming
Abstract: Proof-oriented programs mix computational content with proofs of program correctness. However, the human effort involved in programming and proving is still substantial, despite the use of Satisfiability Modulo Theories (SMT) solvers to automate proofs in languages such as F*. Seeking to spur research on using AI to automate the construction of proof-oriented programs, we curate a dataset of 600K lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux to Python and Firefox. Our dataset includes around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem producing a definition given a formal specification expressed as an F* type. We provide a program fragment checker that queries F* to check the correctness of candidate solutions. We also report on an extended version of our dataset containing a total of 940K lines of programs and proofs, with a total of 54k top-level F* definitions. We believe this is the largest corpus of SMT-assisted program proofs coupled with a reproducible program-fragment checker. Grounded in this dataset, we investigate the use of AI to synthesize programs and their proofs in F*, with promising results. Our main finding in that the performance of fine-tuned smaller LLMs (such as Phi-2 or StarCoder) compare favorably with LLMs (such as GPT-4), at a much lower computational cost. We also identify various type-based retrieval augmentation techniques and find that they boost performance significantly. With detailed error analysis and case studies, we identify potential strengths and weaknesses of models and techniques and suggest directions for future improvements.
- H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of github copilot’s code contributions,” in 2022 IEEE Symposium on Security and Privacy (SP), 2022, pp. 754–768.
- E. First, M. Rabe, T. Ringer, and Y. Brun, “Baldur: Whole-proof generation and repair with large language models,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1229–1241.
- G. Lample, T. Lacroix, M.-A. Lachaux, A. Rodriguez, A. Hayat, T. Lavril, G. Ebner, and X. Martinet, “Hypertree proof search for neural theorem proving,” Advances in Neural Information Processing Systems, vol. 35, pp. 26 337–26 349, 2022.
- A. Thakur, Y. Wen, and S. Chaudhuri, “A language-agent approach to formal theorem-proving,” arXiv preprint arXiv:2310.04353, 2023.
- A. Sanchez-Stern, Y. Alhessi, L. Saul, and S. Lerner, “Generating correctness proofs with neural networks,” in Proceedings of the 4th ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2020, pp. 1–10.
- K. Yang and J. Deng, “Learning to prove theorems via interacting with proof assistants,” in International Conference on Machine Learning. PMLR, 2019, pp. 6984–6994.
- K. Yang, A. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. J. Prenger, and A. Anandkumar, “Leandojo: Theorem proving with retrieval-augmented language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- T. mathlib Community, “The lean mathematical library,” in Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, ser. CPP 2020. New York, NY, USA: Association for Computing Machinery, 2020, p. 367–381. [Online]. Available: https://doi.org/10.1145/3372885.3373824
- N. Swamy, C. Hritcu, C. Keller, A. Rastogi, A. Delignat-Lavaud, S. Forest, K. Bhargavan, C. Fournet, P.-Y. Strub, M. Kohlweiss, J.-K. Zinzindohoué, and S. Zanella-Béguelin, “Dependent types and multi-monadic effects in F*,” in 43rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). ACM, Jan. 2016, pp. 256–270. [Online]. Available: https://www.fstar-lang.org/papers/mumon/
- K. R. M. Leino, “Dafny: An automatic program verifier for functional correctness,” in Logic for Programming, Artificial Intelligence, and Reasoning: 16th International Conference, LPAR-16, Dakar, Senegal, April 25–May 1, 2010, Revised Selected Papers 16. Springer, 2010, pp. 348–370.
- P. Müller, M. Schwerhoff, and A. J. Summers, “Viper: A verification infrastructure for permission-based reasoning,” in Verification, Model Checking, and Abstract Interpretation (VMCAI), ser. LNCS, B. Jobstmann and K. R. M. Leino, Eds., vol. 9583. Springer-Verlag, 2016, pp. 41–62. [Online]. Available: https://doi.org/10.1007/978-3-662-49122-5_2
- A. Lattuada, T. Hance, C. Cho, M. Brun, I. Subasinghe, Y. Zhou, J. Howell, B. Parno, and C. Hawblitzel, “Verus: Verifying rust programs using linear ghost types,” Proc. ACM Program. Lang., vol. 7, no. OOPSLA1, apr 2023. [Online]. Available: https://doi.org/10.1145/3586037
- A. Kamath, A. Senthilnathan, S. Chakraborty, P. Deligiannis, S. K. Lahiri, A. Lal, A. Rastogi, S. Roy, and R. Sharma, “Finding inductive loop invariants using large language models,” arXiv preprint arXiv:2311.07948, 2023.
- S. Chakraborty, S. K. Lahiri, S. Fakhoury, M. Musuvathi, A. Lal, A. Rastogi, A. Senthilnathan, R. Sharma, and N. Swamy, “Ranking llm-generated loop invariants for program verification,” arXiv preprint arXiv:2310.09342, 2023.
- K. Pei, D. Bieber, K. Shi, C. Sutton, and P. Yin, “Can large language models reason about program invariants?” 2023.
- C. Liu, X. Wu, Y. Feng, Q. Cao, and J. Yan, “Towards general loop invariant generation via coordinating symbolic execution and large language models,” arXiv preprint arXiv:2311.10483, 2023.
- C. Sun, Y. Sheng, O. Padon, and C. Barrett, “Clover: Closed-loop verifiable code generation,” arXiv preprint arXiv:2310.17807, 2023.
- M. Rakib Hossain Misu, C. V. Lopes, I. Ma, and J. Noble, “Towards ai-assisted synthesis of verified dafny methods,” arXiv e-prints, pp. arXiv–2402, 2024.
- R. OpenAI, “Gpt-4 technical report. arxiv 2303.08774,” View in Article, vol. 2, 2023.
- OpenAI, “Gpt-4 technical report,” 2023.
- S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. Del Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikivi et al., “Textbooks are all you need,” arXiv preprint arXiv:2306.11644, 2023.
- A. Mitra, L. Del Corro, S. Mahajan, A. Codas, C. Simoes, S. Agarwal, X. Chen, A. Razdaibiedina, E. Jones, K. Aggarwal et al., “Orca 2: Teaching small language models how to reason,” arXiv preprint arXiv:2311.11045, 2023.
- R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source be with you!” arXiv preprint arXiv:2305.06161, 2023.
- C. Hawblitzel, J. Howell, J. R. Lorch, A. Narayan, B. Parno, D. Zhang, and B. Zill, “Ironclad apps: End-to-End security via automated Full-System verification,” in 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). Broomfield, CO: USENIX Association, Oct. 2014, pp. 165–181. [Online]. Available: https://www.usenix.org/conference/osdi14/technical-sessions/presentation/hawblitzel
- J.-K. Zinzindohoué, K. Bhargavan, J. Protzenko, and B. Beurdouche, “HACL*: A verified modern cryptographic library,” in ACM Conference on Computer and Communications Security. ACM, 2017, pp. 1789–1806. [Online]. Available: http://eprint.iacr.org/2017/536
- L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” in Tools and Algorithms for the Construction and Analysis of Systems: 14th International Conference, TACAS 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29-April 6, 2008. Proceedings 14. Springer, 2008, pp. 337–340.
- J. Protzenko, J.-K. Zinzindohoué, A. Rastogi, T. Ramananandro, P. Wang, S. Zanella-Béguelin, A. Delignat-Lavaud, C. Hritcu, K. Bhargavan, C. Fournet, and N. Swamy, “Verified low-level programming embedded in F*,” PACMPL, vol. 1, no. ICFP, pp. 17:1–17:29, Sep. 2017. [Online]. Available: http://arxiv.org/abs/1703.00053
- T. Ramananandro, A. Delignat-Lavaud, C. Fournet, N. Swamy, T. Chajed, N. Kobeissi, and J. Protzenko, “Everparse: Verified secure zero-copy parsers for authenticated message formats,” in Proceedings of the 28th USENIX Conference on Security Symposium, ser. SEC’19. USA: USENIX Association, 2019, p. 1465–1482.
- N. Swamy, T. Ramananandro, A. Rastogi, I. Spiridonova, H. Ni, D. Malloy, J. Vazquez, M. Tang, O. Cardona, and A. Gupta, “Hardening attack surfaces with formally proven binary format parsers,” in Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI ’22), June 13–17, 2022, San Diego, CA, USA, 2022. [Online]. Available: https://www.fstar-lang.org/papers/EverParse3D.pdf
- A. Fromherz, N. Giannarakis, C. Hawblitzel, B. Parno, A. Rastogi, and N. Swamy, “A verified, efficient embedding of a verifiable assembly language,” PACMPL, no. POPL, 2019. [Online]. Available: https://github.com/project-everest/project-everest.github.io/raw/master/assets/vale-popl.pdf
- J. Protzenko, B. Parno, A. Fromherz, C. Hawblitzel, M. Polubelova, K. Bhargavan, B. Beurdouche, J. Choi, A. Delignat-Lavaud, C. Fournet, N. Kulatova, T. Ramananandro, A. Rastogi, N. Swamy, C. M. Wintersteiger, and S. Zanella-Beguelin, “Evercrypt: A fast, verified, cross-platform cryptographic provider,” in 2020 IEEE Symposium on Security and Privacy (SP), 2020, pp. 983–1002.
- K. Bhargavan, A. Delignat-Lavaud, C. Fournet, M. Kohlweiss, J. Pan, J. Protzenko, A. Rastogi, N. Swamy, S. Zanella Béguelin, and J. K. Zinzindohoue, “Implementing and proving the TLS 1.3 record layer,” IEEE Security & Privacy, 2017.
- A. Delignat-Lavaud, C. Fournet, B. Parno, J. Protzenko, T. Ramananandro, J. Bosamiya, J. Lallemand, I. Rakotonirina, and Y. Zhou, “A security model and fully verified implementation for the ietf quic record layer,” in 2021 IEEE Symposium on Security and Privacy (SP), 2021, pp. 1162–1178.
- A. Fromherz, A. Rastogi, N. Swamy, S. Gibson, G. Martínez, D. Merigoux, and T. Ramananandro, “Steel: Proof-oriented programming in a dependently typed concurrent separation logic,” in 25th ACM SIGPLAN International Conference on Functional Programming (ICFP), Aug. 2021. [Online]. Available: https://www.fstar-lang.org/papers/steel/
- P.-M. Osera and S. Zdancewic, “Type-and-example-directed program synthesis,” SIGPLAN Not., vol. 50, no. 6, p. 619–630, jun 2015. [Online]. Available: https://doi.org/10.1145/2813885.2738007
- Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang, J. Callan, and G. Neubig, “Active retrieval augmented generation,” arXiv preprint arXiv:2305.06983, 2023.
- P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
- M. R. Parvez, W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Retrieval augmented code generation and summarization,” arXiv preprint arXiv:2108.11601, 2021.
- A. Neelakantan, T. Xu, R. Puri, A. Radford, J. M. Han, J. Tworek, Q. Yuan, N. Tezak, J. W. Kim, C. Hallacy, J. Heidecke, P. Shyam, B. Power, T. E. Nekoul, G. Sastry, G. Krueger, D. Schnurr, F. P. Such, K. Hsu, M. Thompson, T. Khan, T. Sherbakov, J. Jang, P. Welinder, and L. Weng, “Text and code embeddings by contrastive pre-training,” CoRR, vol. abs/2201.10005, 2022. [Online]. Available: https://arxiv.org/abs/2201.10005
- N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. [Online]. Available: https://arxiv.org/abs/1908.10084
- E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
- Y. Mao, L. Mathias, R. Hou, A. Almahairi, H. Ma, J. Han, W.-t. Yih, and M. Khabsa, “Unipelt: A unified framework for parameter-efficient language model tuning,” arXiv preprint arXiv:2110.07577, 2021.
- T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Huggingface’s transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019.
- L. A. Agrawal, A. Kanade, N. Goyal, S. Lahiri, and S. Rajamani, “Monitor-guided decoding of code lms with static analysis of repository context,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- Y. Wei, C. S. Xia, and L. Zhang, “Copiloting the copilots: Fusing large language models with completion engines for automated program repair,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 172–184.
- C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer, “Genprog: A generic method for automatic software repair,” Ieee transactions on software engineering, vol. 38, no. 1, pp. 54–72, 2011.
- S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff et al., “Pythia: A suite for analyzing large language models across training and scaling,” in International Conference on Machine Learning. PMLR, 2023, pp. 2397–2430.
- G. Klein, K. Elphinstone, G. Heiser, J. Andronick, D. Cock, P. Derrin, D. Elkaduwe, K. Engelhardt, R. Kolanski, M. Norrish, T. Sewell, H. Tuch, and S. Winwood, “sel4: formal verification of an os kernel,” in Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, ser. SOSP ’09. New York, NY, USA: Association for Computing Machinery, 2009, p. 207–220. [Online]. Available: https://doi.org/10.1145/1629575.1629596
- J. C. Blanchette, S. Böhme, and L. C. Paulson, “Extending sledgehammer with SMT solvers,” J. Autom. Reason., vol. 51, no. 1, pp. 109–128, 2013. [Online]. Available: https://doi.org/10.1007/s10817-013-9278-5
- D. Kühlwein, J. C. Blanchette, C. Kaliszyk, and J. Urban, “Mash: Machine learning for sledgehammer,” in Interactive Theorem Proving - 4th International Conference, ITP 2013, Rennes, France, July 22-26, 2013. Proceedings, ser. Lecture Notes in Computer Science, S. Blazy, C. Paulin-Mohring, and D. Pichardie, Eds., vol. 7998. Springer, 2013, pp. 35–50. [Online]. Available: https://doi.org/10.1007/978-3-642-39634-2_6
- M. Mikula, S. Antoniak, S. Tworkowski, A. Q. Jiang, J. P. Zhou, C. Szegedy, L. Kucinski, P. Milos, and Y. Wu, “Magnushammer: A transformer-based approach to premise selection,” CoRR, vol. abs/2303.04488, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.04488
- T. Gauthier, C. Kaliszyk, J. Urban, R. Kumar, and M. Norrish, “Tactictoe: Learning to prove with tactics,” J. Autom. Reason., vol. 65, no. 2, pp. 257–286, 2021. [Online]. Available: https://doi.org/10.1007/s10817-020-09580-x
- E. First, Y. Brun, and A. Guha, “Tactok: Semantics-aware proof synthesis,” Proceedings of the ACM on Programming Languages, vol. 4, no. OOPSLA, pp. 1–31, 2020.
- J. M. Han, J. Rute, Y. Wu, E. W. Ayers, and S. Polu, “Proof artifact co-training for theorem proving with language models,” CoRR, vol. abs/2102.06203, 2021. [Online]. Available: https://arxiv.org/abs/2102.06203
- H. Xin, H. Wang, C. Zheng, L. Li, Z. Liu, Q. Cao, Y. Huang, J. Xiong, H. Shi, E. Xie et al., “Lego-prover: Neural theorem proving with growing libraries,” arXiv preprint arXiv:2310.00656, 2023.
- A. Q. Jiang, S. Welleck, J. P. Zhou, T. Lacroix, J. Liu, W. Li, M. Jamnik, G. Lample, and Y. Wu, “Draft, sketch, and prove: Guiding formal theorem provers with informal proofs,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. [Online]. Available: https://openreview.net/pdf?id=SMa9EAovKMC
- S. Welleck, J. Liu, X. Lu, H. Hajishirzi, and Y. Choi, “Naturalprover: Grounded mathematical proof generation with language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 4913–4927, 2022.
- Y. Huang, X. Lin, Z. Liu, Q. Cao, H. Xin, H. Wang, Z. Li, L. Song, and X. Liang, “Mustard: Mastering uniform synthesis of theorem and proof data,” arXiv preprint arXiv:2402.08957, 2024.
- J. Yao, Z. Zhou, W. Chen, and W. Cui, “Leveraging large language models for automated proof synthesis in rust,” arXiv preprint arXiv:2311.03739, 2023.
- W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre-training for program understanding and generation,” arXiv preprint arXiv:2103.06333, 2021.
- S. Chakraborty, T. Ahmed, Y. Ding, P. T. Devanbu, and B. Ray, “Natgen: generative pre-training by “naturalizing” source code,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 18–30.
- S. K. Lahiri, A. Naik, G. Sakkas, P. Choudhury, C. von Veh, M. Musuvathi, J. P. Inala, C. Wang, and J. Gao, “Interactive code generation via test-driven user-intent formalization,” CoRR, vol. abs/2208.05950, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2208.05950
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
- M. Endres, S. Fakhoury, S. Chakraborty, and S. K. Lahiri, “Formalizing natural language intent into program specifications via large language models,” CoRR, vol. abs/2310.01831, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.01831
- D. Key, W.-D. Li, and K. Ellis, “I speak, you verify: Toward trustworthy neural program synthesis,” 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.