Guess & Sketch: Language Model Guided Transpilation (2309.14396v2)
Abstract: Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural LLMs (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task.
- Aus BASIC mach C:B to C transpiler. Amiga-Magazin, 1988(6):101.
- URL https://developer.apple.com/documentation/apple-silicon/about-the-rosetta-translation-environment.
- Ieee standard for binary floating-point arithmetic. ANSI/IEEE Std 754-1985, pp. 1–20, 1985. doi: 10.1109/IEEESTD.1985.82928.
- The occam transpiler. Byte Magazine, 14(13):350, 1989.
- Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., USA, 2006. ISBN 0321486811.
- Esesc: A fast multicore simulator using time-based sampling. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pp. 448–459, 2013. doi: 10.1109/HPCA.2013.6522340.
- Slade: A portable small language model decompiler for optimized assembler, 2023.
- Fabrice Bellard. Qemu, a fast and portable dynamic translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC ’05, pp. 41, USA, 2005. USENIX Association.
- Language models are few-shot learners, 2020.
- Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
- Tree-to-tree neural networks for program translation, 2018. URL https://openreview.net/forum?id=rkxY-sl0W.
- Z3: An efficient smt solver. In C. R. Ramakrishnan and Jakob Rehof (eds.), Tools and Algorithms for the Construction and Analysis of Systems, pp. 337–340, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg. ISBN 978-3-540-78800-3.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- Codebert: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1536–1547, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.139. URL https://aclanthology.org/2020.findings-emnlp.139.
- Learning to complete code with sketches. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=q79uMSC6ZBT.
- Computer Architecture, Fifth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 5th edition, 2011. ISBN 012383872X.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Towards porting operating systems with program synthesis. ACM Trans. Program. Lang. Syst., 45(1), mar 2023. ISSN 0164-0925. doi: 10.1145/3563943. URL https://doi.org/10.1145/3563943.
- Efficient long-text understanding with short-text models. 2022.
- Large language models struggle to learn long-tail knowledge, 2022.
- Phrase-based statistical translation of programming languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, Onward! 2014, pp. 173–184, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450332101. doi: 10.1145/2661136.2661148. URL https://doi.org/10.1145/2661136.2661148.
- The stack: 3 tb of permissively licensed source code. Preprint, 2022.
- Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 177–180, Prague, Czech Republic, June 2007. Association for Computational Linguistics. URL https://aclanthology.org/P07-2045.
- Toward code generation: A survey and lessons from semantic parsing, 2021.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
- Starcoder: may the source be with you!, 2023.
- The larger they are, the harder they fail: Language models do not recognize identifier swaps in python, 2023.
- Improved division by invariant integers. IEEE Transactions on Computers, 60(2):165–175, 2011. doi: 10.1109/TC.2010.143.
- Lexical statistical machine translation for language migration. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, pp. 651–654, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450322379. doi: 10.1145/2491411.2494584. URL https://doi.org/10.1145/2491411.2494584.
- Learning to infer program sketches. CoRR, abs/1902.06349, 2019. URL http://arxiv.org/abs/1902.06349.
- OpenAI. Gpt-4 technical report, 2023.
- A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2249–2255, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1244. URL https://aclanthology.org/D16-1244.
- Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990. ISBN 1558800698.
- Improving language understanding by generative pre-training. In arxiv, 2018.
- Unsupervised translation of programming languages. Advances in Neural Information Processing Systems, 33, 2020.
- Leveraging automated unit tests for unsupervised code translation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=cmt-6KtR4c4.
- Code llama: Open foundation models for code, 2023.
- Zsim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, pp. 475–486, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450320795. doi: 10.1145/2485922.2485963. URL https://doi.org/10.1145/2485922.2485963.
- arm2riscv. https://github.com/schorrm/arm2riscv, 2020.
- Armando Solar-Lezama. The sketching approach to program synthesis. In Zhenjiang Hu (ed.), Programming Languages and Systems, pp. 4–13, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg. ISBN 978-3-642-10672-9.
- Combinatorial sketching for finite programs. SIGARCH Comput. Archit. News, 34(5):404–415, oct 2006a. ISSN 0163-5964. doi: 10.1145/1168919.1168907. URL https://doi.org/10.1145/1168919.1168907.
- Combinatorial sketching for finite programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, pp. 404–415, New York, NY, USA, 2006b. Association for Computing Machinery. ISBN 1595934510. doi: 10.1145/1168857.1168907. URL https://doi.org/10.1145/1168857.1168907.
- Code translation with compiler representations. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XomEU3eNeSQ.
- Growing solver-aided languages with rosette. Onward! 2013, pp. 135–152, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450324724. doi: 10.1145/2509578.2509586. URL https://doi.org/10.1145/2509578.2509586.
- Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- Enhancing cross-isa dbt through automatically learned translation rules. SIGPLAN Not., 53(2):84–97, mar 2018. ISSN 0362-1340. doi: 10.1145/3296957.3177160. URL https://doi.org/10.1145/3296957.3177160.
- Celine Lee (10 papers)
- Abdulrahman Mahmoud (9 papers)
- Michal Kurek (1 paper)
- Simone Campanoni (7 papers)
- David Brooks (204 papers)
- Stephen Chong (18 papers)
- Gu-Yeon Wei (54 papers)
- Alexander M. Rush (115 papers)