Self-Supervised Learning to Prove Equivalence Between Straight-Line Programs via Rewrite Rules (2109.10476v4)
Abstract: We target the problem of automatically synthesizing proofs of semantic equivalence between two programs made of sequences of statements. We represent programs using abstract syntax trees (AST), where a given set of semantics-preserving rewrite rules can be applied on a specific AST pattern to generate a transformed and semantically equivalent program. In our system, two programs are equivalent if there exists a sequence of application of these rewrite rules that leads to rewriting one program into the other. We propose a neural network architecture based on a transformer model to generate proofs of equivalence between program pairs. The system outputs a sequence of rewrites, and the validity of the sequence is simply checked by verifying it can be applied. If no valid sequence is produced by the neural network, the system reports the programs as non-equivalent, ensuring by design no programs may be incorrectly reported as equivalent. Our system is fully implemented for one single grammar which can represent straight-line programs with function calls and multiple types. To efficiently train the system to generate such sequences, we develop an original incremental training technique, named self-supervised sample selection. We extensively study the effectiveness of this novel training approach on proofs of increasing complexity and length. Our system, S4Eq, achieves 97% proof success on a curated dataset of 10,000 pairs of equivalent programs.
- Ian J. Goodfellow, Yoshua Bengio and Aaron Courville “Deep Learning” http://www.deeplearningbook.org Cambridge, MA, USA: MIT Press, 2016
- Donald M Kaplan “Regular expressions and the equivalence of programs” In Journal of Computer and System Sciences 3.4 Academic Press, 1969, pp. 361–386
- “Inference rules for proving the equivalence of recursive procedures” In Acta Informatica 45.6 Springer, 2008, pp. 403–439
- Sven Verdoolaege, Gerda Janssens and Maurice Bruynooghe “Equivalence checking of static affine programs using widening to handle recurrences” In Computer aided verification, 2009, pp. 599–613 Springer
- “Well-structured program equivalence is highly undecidable” In ACM Transactions on Computational Logic (TOCL) 13.3 ACM, 2012, pp. 26
- Nachum Dershowitz “Computing with rewrite systems” In Information and Control 65.2-3 Elsevier, 1985, pp. 122–157
- George C Necula “Translation validation for an optimizing compiler” In ACM SIGPLAN Notices 35.5 ACM, 2000, pp. 83–94
- Philip Ginsbach, Bruce Collie and Michael FP O’Boyle “Automatically harnessing sparse acceleration” In Proceedings of the 29th International Conference on Compiler Construction, 2020, pp. 179–190
- “Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software and Algorithm Plagiarism Detection” In IEEE Transactions on Software Engineering 43.12, 2017, pp. 1157–1177 DOI: 10.1109/TSE.2017.2655046
- “Program Equivalence for Assisted Grading of Functional Programs” In Proc. ACM Program. Lang. 4.OOPSLA New York, NY, USA: Association for Computing Machinery, 2020 DOI: 10.1145/3428239
- “OpenNMT: Open-Source Toolkit for Neural Machine Translation” In Proc. ACL, 2017 DOI: 10.18653/v1/P17-4012
- GitHub “The 2020 State of the Octoverse”, 2021 URL: https://octoverse.github.com/
- John Cocke “Global common subexpression elimination” In Proceedings of a symposium on Compiler optimization, 1970, pp. 20–24
- “Language-parametric compiler validation with application to LLVM” In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021, pp. 1004–1019
- Kunal Banerjee, Chittaranjan Mandal and Dipankar Sarkar “Extending the scope of translation validation by augmenting path based equivalence checkers with SMT solvers” In 18th International Symposium on VLSI Design and Test, 2014, pp. 1–6 IEEE
- Steve Kommrusch “S4Eq Software”, https://github.com/SteveKommrusch/PrgEq, 2021
- “Probabilistic Algorithms for Deciding Equivalence of Straight-Line Programs” In J. ACM 30, 1983, pp. 217–228 DOI: 10.1145/322358.322373
- Vijay S Pai and Sarita Adve “Code transformations to improve memory parallelism” In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, 1999, pp. 147–155 IEEE
- “Source-to-source optimization for HLS” In FPGAs for Software Programmers Springer, 2016, pp. 137–163
- “Program analysis for compiler validation” In Proceedings of the 8th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, 2008, pp. 1–7
- Steve Kommrusch, Théo Barollet and Louis-Noël Pouchet “Equivalence of dataflow graphs via rewrite rules using a graph-to-sequence neural model” In arXiv preprint arXiv:2002.06799, 2020
- Steve Kommrusch “MACHINE LEARNING FOR COMPUTER AIDED PROGRAMMING: FROM STOCHASTIC PROGRAM REPAIR TO VERIFIABLE PROGRAM EQUIVALENCE”, 2021
- “Taylor expansion diagrams: A compact, canonical representation with applications to symbolic verification” In Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition, 2002, pp. 285–289 IEEE
- Steven Muchnick “Advanced Compiler Design Implementation.” Morgan Kaufman, 1997
- George C. Necula “Translation Validation for an Optimizing Compiler” In SIGPLAN Not. 35.5 Association for Computing Machinery, 2000, pp. 83–94 DOI: 10.1145/358438.349314
- “Verification of Loop and Arithmetic Transformations of Array-Intensive Behaviors” In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 32.11, 2013, pp. 1787–1800 DOI: 10.1109/TCAD.2013.2272536
- “Polycheck: Dynamic verification of iteration space transformations on affine programs” In ACM SIGPLAN Notices 51.1, 2016, pp. 539–554 ACM
- “When polyhedral transformations meet SIMD code generation” In Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation, 2013, pp. 127–138
- “Black-Box Equivalence Checking Across Compiler Optimizations” In Asian Symposium on Programming Languages and Systems, 2017
- “Learning dynamic polynomial proofs” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019, pp. 4179–4188 URL: http://papers.nips.cc/paper/8671-learning-dynamic-polynomial-proofs.pdf
- “HOList: An Environment for Machine Learning of Higher Order Logic Theorem Proving” In Proceedings of the 36th International Conference on Machine Learning 97, Proceedings of Machine Learning Research Long Beach, California, USA: PMLR, 2019, pp. 454–463 URL: http://proceedings.mlr.press/v97/bansal19a.html
- “Graph Representations for Higher-Order Logic and Theorem Proving” In arXiv e-prints, 2019, pp. arXiv:1905.10006 arXiv:1905.10006 [cs.LG]
- Sal Khan “Properties of matrix multiplication” In Khan Academy (accessed May 20, 2020), 2020 URL: https://www.khanacademy.org/math/precalculus/x9e81a4f98389efdf:matrices/x9e81a4f98389efdf:properties-of-matrix-multiplication/a/properties-of-matrix-multiplication
- “On the naturalness of software” In 2012 34th International Conference on Software Engineering (ICSE), 2012, pp. 837–847 IEEE
- “SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair” In IEEE Transactions on Software Engineering, 2019 DOI: 10.1109/TSE.2019.2940179
- I Sutskever, O Vinyals and QV Le “Sequence to sequence learning with neural networks” In Advances in NIPS, 2014
- “Google’s neural machine translation system: Bridging the gap between human and machine translation” In arXiv preprint arXiv:1609.08144, 2016
- “Abstractive text summarization using sequence-to-sequence rnns and beyond” In arXiv preprint arXiv:1602.06023, 2016
- “Attention is all you need” In Advances in neural information processing systems, 2017, pp. 5998–6008
- Diederik P. Kingma and Jimmy Ba “Adam: A Method for Stochastic Optimization” In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015 URL: http://arxiv.org/abs/1412.6980
- “An empirical investigation of catastrophic forgetting in gradient-based neural networks” In arXiv preprint arXiv:1312.6211, 2013
- Thomas G. Dietterich “Ensemble Methods in Machine Learning” In Proceedings of the First International Workshop on Multiple Classifier Systems, MCS ’00 Berlin, Heidelberg: Springer-Verlag, 2000, pp. 1–15
- “Hindsight Experience Replay” In Advances in Neural Information Processing Systems 30 Curran Associates, Inc., 2017, pp. 5048–5058 URL: http://papers.nips.cc/paper/7090-hindsight-experience-replay.pdf
- Peng Zhao and José Nelson Amaral “Ablego: A Function Outlining and Partial Inlining Framework: Research Articles” In Softw. Pract. Exper. 37.5 USA: John Wiley & Sons, Inc., 2007, pp. 465–491
- Lutz Prechelt “Early stopping-but when?” In Neural Networks: Tricks of the trade Springer, 1998, pp. 55–69
- “Client-Specific Equivalence Checking” In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018 Montpellier, France: Association for Computing Machinery, 2018, pp. 441–451 DOI: 10.1145/3238147.3238178
- Steve Kommrusch, Théo Barollet and Louis-Noël Pouchet “Proving Equivalence Between Complex Expressions Using Graph-to-Sequence Neural Models” In CoRR abs/2106.02452, 2021 arXiv: https://arxiv.org/abs/2106.02452
- “Scaling Laws for Neural Language Models” In ArXiv abs/2001.08361, 2020
- Sven Verdoolaege, Gerda Janssens and Maurice Bruynooghe “Equivalence checking of static affine programs using widening to handle recurrences” In ACM Trans. on Programming Languages and Systems (TOPLAS) 34.3 ACM, 2012, pp. 11
- “On the recognition of algorithm templates” In Electronic Notes in Theoretical Computer Science 82.2 Elsevier, 2004, pp. 395–409
- Denis Barthou, Paul Feautrier and Xavier Redon “On the equivalence of two systems of affine recurrence equations” In Euro-Par 2002 Parallel Processing, 2002
- Guillaume Iooss, Christophe Alias and Sanjay Rajopadhye “On program equivalence with reductions” In International Static Analysis Symposium, 2014, pp. 168–183 Springer
- “Verification of Polyhedral Optimizations with Constant Loop Bounds in Finite State Space Computations” In Proc. of the 6th International Symposium On Leveraging Applications of Formal Methods, Verification and Validation Springer, 2014
- “Semantic program alignment for equivalence checking” In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2019, pp. 1027–1040
- “ARDiff: Scaling Program Equivalence Checking via Iterative Abstraction and Refinement of Common Code” In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020 Virtual Event, USA: Association for Computing Machinery, 2020, pp. 13–24 DOI: 10.1145/3368089.3409757
- “Interactive theorem proving and program development: Coq’Art: the calculus of inductive constructions” Springer Science & Business Media, 2013
- Lawrence C. Paulson “Isabelle Page”, https://www.cl.cam.ac.uk/research/hvg/Isabelle
- Bernhard Steffen “Data flow analysis as model checking” In International Symposium on Theoretical Aspects of Computer Software, 1991, pp. 346–364 Springer
- Edmund Clarke, Daniel Kroening and Karen Yorav “Behavioral consistency of C and Verilog programs using bounded model checking” In Proceedings 2003. Design Automation Conference (IEEE Cat. No. 03CH37451), 2003, pp. 368–371 IEEE
- “Model checking programs” In Automated software engineering 10.2 Springer, 2003, pp. 203–232
- Kedar S Namjoshi and Robert P Kurshan “Syntactic program transformations for automatic abstraction” In International Conference on Computer Aided Verification, 2000, pp. 435–449 Springer
- Sara Kalvala, Richard Warburton and David Lacey “Program transformations using temporal logic side conditions” In ACM Trans. on Programming Languages and Systems (TOPLAS) 31.4 ACM, 2009, pp. 14
- “A framework for formal verification of compiler optimizations” In Interactive Theorem Proving Springer, 2010
- Eelco Visser “Program transformation with Stratego/XT” In Domain-specific program generation Springer, 2004, pp. 216–238
- “Program equivalence by circular reasoning” In Formal Aspects of Computing 27.4 Springer, 2015, pp. 701–726
- Uday S Reddy “Rewriting techniques for program synthesis” In International Conference on Rewriting Techniques and Applications, 1989, pp. 388–403 Springer
- “Egg: Fast and Extensible Equality Saturation” In Proc. ACM Program. Lang. 5.POPL New York, NY, USA: Association for Computing Machinery, 2021 DOI: 10.1145/3434304
- Andrzej S Murawski and Joël Ouaknine “On probabilistic program equivalence and refinement” In International Conference on Concurrency Theory, 2005, pp. 156–170 Springer
- “Approximate probabilistic model checking” In International Workshop on Verification, Model Checking, and Abstract Interpretation, 2004, pp. 73–84 Springer
- “Probabilistic theorem proving” In arXiv preprint arXiv:1202.3724, 2012
- Sahar Badihi, Yi Li and Julia Rubin “EqBench: A Dataset of Equivalent and Non-equivalent Program Pairs” In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 2021, pp. 610–614 DOI: 10.1109/MSR52588.2021.00084
- “An Appraisal of Incremental Learning Methods” In Entropy 22.11, 2020 DOI: 10.3390/e22111190
- He Ye, Matias Martinez and Martin Monperrus “Neural Program Repair with Execution-based Backpropagation” In CoRR abs/2105.04123, 2021 arXiv: https://arxiv.org/abs/2105.04123
- Wei Ding “Exploring the Possibilities of Applying Transfer Learning Methods for Natural Language Processing in Software Development”, 2021
- Antonio-Javier Gallego, Jorge Calvo-Zaragoza and Robert B. Fisher “Incremental Unsupervised Domain-Adversarial Training of Neural Networks” In IEEE Transactions on Neural Networks and Learning Systems 32.11, 2021, pp. 4864–4878 DOI: 10.1109/TNNLS.2020.3025954
- Shan Huang, Xiao Zhou and Sang Chin “Application of Seq2Seq Models on Code Correction” In Frontiers in artificial intelligence 4, 2021, pp. 590215 DOI: 10.3389/frai.2021.590215
- “Decision Transformer: Reinforcement Learning via Sequence Modeling” In CoRR abs/2106.01345, 2021 arXiv: https://arxiv.org/abs/2106.01345
- “Deep Learning For Symbolic Mathematics” In International Conference on Learning Representations, 2020 URL: https://openreview.net/forum?id=S1eZYeHFDS
- “HyperTree Proof Search for Neural Theorem Proving” arXiv, 2022 DOI: 10.48550/ARXIV.2205.11491
- “Recognizing and Verifying Mathematical Equations using Multiplicative Differential Neural Units” In 35th AAAI Conference on Artificial Intelligence AAAI Press, 2021, pp. 5006–5015 URL: https://ojs.aaai.org/index.php/AAAI/article/view/16634
- “A Deep Reinforcement Learning Approach to First-Order Logic Theorem Proving” In Proceedings of the AAAI Conference on Artificial Intelligence 35.7, 2021, pp. 6279–6287 URL: https://ojs.aaai.org/index.php/AAAI/article/view/16780
- “Generative Language Modeling for Automated Theorem Proving” In arXiv e-prints, 2020, pp. arXiv:2009.03393 DOI: 10.48550/arXiv.2009.03393
- “A Survey of Machine Learning for Big Code and Naturalness” In ACM Comput. Surv. 51.4 New York, NY, USA: ACM, 2018, pp. 81:1–81:37 DOI: 10.1145/3212695
- “Code2Vec: Learning Distributed Representations of Code” In Proc. ACM Program. Lang. 3.POPL New York, NY, USA: ACM, 2019, pp. 40:1–40:29 DOI: 10.1145/3290353
- “An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation” In ACM Trans. Softw. Eng. Methodol. 28.4 New York, NY, USA: ACM, 2019, pp. 19:1–19:29 DOI: 10.1145/3340544
- “DIRE: A Neural Approach to Decompiled Identifier Naming” In International Conference on Automated Software Engineering, ASE ’19, 2019
- Veselin Raychev, Martin Vechev and Andreas Krause “Predicting Program Properties from "Big Code"” In Proceedings of the 42Nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’15 Mumbai, India: ACM, 2015, pp. 111–124 DOI: 10.1145/2676726.2677009
- Rohan Bavishi, Michael Pradel and Koushik Sen “Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts”, 2017 URL: http://tubiblio.ulb.tu-darmstadt.de/101419/
- Pavol Bielik, Veselin Raychev and Martin Vechev “PHOG: Probabilistic Model for Code” In Proceedings of The 33rd International Conference on Machine Learning 48, Proceedings of Machine Learning Research New York, New York, USA: PMLR, 2016, pp. 2933–2942 URL: http://proceedings.mlr.press/v48/bielik16.pdf
- Zimin Chen, Steve James Kommrusch and Martin Monperrus “Neural Transfer Learning for Repairing Security Vulnerabilities in C Code” In IEEE Transactions on Software Engineering, 2022, pp. 1–1 DOI: 10.1109/TSE.2022.3147265
- “Generating Bug-Fixes Using Pretrained Transformers” In Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming, MAPS 2021 Virtual, Canada: Association for Computing Machinery, 2021, pp. 1–8 DOI: 10.1145/3460945.3464951
- “An Empirical Evaluation of Rule Extraction from Recurrent Neural Networks” In Neural Comput. 30.9 Cambridge, MA, USA: MIT Press, 2018, pp. 2568–2591 DOI: 10.1162/neco_a_01111
- M. Tomita “Dynamic Construction of Finite Automata from examples using Hill-climbing” In Proceedings of the Fourth Annual Conference of the Cognitive Science Society, 1982, pp. 105–108
- “On the generalizability of Neural Program Models with respect to semantic-preserving program transformations” In Information and Software Technology 135, 2021, pp. 106552 DOI: https://doi.org/10.1016/j.infsof.2021.106552
- Nghi D.Q. Bui “Efficient Framework for Learning Code Representations through Semantic-Preserving Program Transformations” In arXiv e-prints, 2020, pp. arXiv:2009.02731 arXiv:2009.02731 [cs.SE]
- Nghi D.Q. Bui, Yijun Yu and Lingxiao Jiang “Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations” In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval New York, NY, USA: Association for Computing Machinery, 2021, pp. 511–521 URL: https://doi.org/10.1145/3404835.3462840
- Miltiadis Allamanis, Henry Jackson-Flux and Marc Brockschmidt “Self-Supervised Bug Detection and Repair” In NeurIPS, 2021
- “Learning from Self-Sampled Correct and Partially-Correct Programs” arXiv, 2022 DOI: 10.48550/ARXIV.2205.14318