Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Translating Real-World Code with LLMs: A Study of Translating to Rust (2405.11514v2)

Published 19 May 2024 in cs.SE
Towards Translating Real-World Code with LLMs: A Study of Translating to Rust

Abstract: LLMs show promise in code translation - the task of translating code written in one programming language to another language - due to their ability to write code in most programming languages. However, LLM's effectiveness on translating real-world code remains largely unstudied. In this work, we perform the first substantial study on LLM-based translation to Rust by assessing the ability of five state-of-the-art LLMs, GPT4, Claude 3, Claude 2.1, Gemini Pro, and Mixtral. We conduct our study on code extracted from real-world open source projects. To enable our study, we develop FLOURINE, an end-to-end code translation tool that uses differential fuzzing to check if a Rust translation is I/O equivalent to the original source program, eliminating the need for pre-existing test cases. As part of our investigation, we assess both the LLM's ability to produce an initially successful translation, as well as their capacity to fix a previously generated buggy one. If the original and the translated programs are not I/O equivalent, we apply a set of automated feedback strategies, including feedback to the LLM with counterexamples. Our results show that the most successful LLM can translate 47% of our benchmarks, and also provides insights into next steps for improvements.

Assessing LLMs for Translating Real-World Code to Rust

The paper "Towards Translating Real-World Code with LLMs: A Study of Translating to Rust" addresses the challenges of translating code from various programming languages to Rust using LLMs. This paper is pivotal as it shifts focus from traditional competitive programming benchmarks to the more complex and variable field of real-world code.

Methodology and Tools

The authors introduce Flourine, a comprehensive tool designed to facilitate this translation process. Flourine's primary function is to employ differential fuzzing to ensure that translated Rust code maintains input/output equivalency with the original source code, which negates the requirement for pre-existing test cases. The paper specifically evaluates five state-of-the-art LLMs: GPT-4, Claude 3, Claude 2.1, Gemini Pro, and Mixtral.

Evaluation and Results

The research evaluates the ability of these LLMs to perform out-of-the-box translations and to correct translations that initially exhibit bugs. They apply several automatic feedback strategies, including counterexamples, to improve the success rate of translations. The analysis covers 8160 translation experiments across 408 code samples sourced from diverse real-world projects, predominantly written in C and Go.

Key insights from the paper reveal that LLMs like Claude 2.1 and Claude 3 have the highest success rates, translating 47% of benchmarks successfully, while Mixtral achieved the lowest success rates at about 21%. The success rates also varied significantly depending on the complexity of the code, such as the number of lines and functions.

Addressing Challenges

The paper identifies that translating larger code samples results in decreased accuracy, which the authors attribute to the inherent stochastic nature of LLMs, making multiple correct token predictions less likely as the code length increases. They propose dividing larger programs into smaller segments as a potential strategy for improving translation success rates. Furthermore, by running Clippy, Rust's linting tool, they observe that while the translations are often syntactically correct, there is room for improvement in adhering to idiomatic Rust code guidelines.

Contrast with Rule-Based Translations

The paper also contrasts LLM-based translations with traditional rule-based translation tools like C2Rust. While rule-based tools ensure syntactic correctness, they often produce verbose and non-idiomatic code. In contrast, LLMs tend to generate more concise and idiomatic Rust code.

Implications and Future Directions

This research has significant practical implications, particularly for developers seeking to modernize legacy code bases by translating them to safer languages like Rust. Theoretically, it opens avenues for further research in improving LLM-based code translation accuracy, especially in addressing larger and more complex code structures.

Future research could explore enhanced feedback mechanisms for LLMs to learn from counterexamples more effectively, and investigate techniques for better segmentation of code to handle complexity. With the ongoing development of LLMs, further studies might also refine the models' capabilities in understanding and generating code that closely aligns with language-specific idioms and standards.

Overall, this paper exemplifies a robust exploratory paper into leveraging advanced AI models for practical software engineering tasks, highlighting challenges and offering insights into potential improvements that align with current and future directions in AI-assisted programming.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. “C to go translator.” https://github.com/gotranspile/cxgo.
  2. “Sharpen - automated java-¿c# coversion.” https://github.com/mono/sharpen.
  3. “C2rust transpiler.” https://c2rust.com/.
  4. Z. Tang, M. Agarwal, A. Shypula, B. Wang, D. Wijaya, J. Chen, and Y. Kim, “Explain-then-translate: an analysis on improving program translation with self-generated explanations,” in Findings of the Association for Computational Linguistics: EMNLP 2023 (H. Bouamor, J. Pino, and K. Bali, eds.), (Singapore), pp. 1741–1788, Association for Computational Linguistics, Dec. 2023.
  5. B. Rozière, M. Lachaux, L. Chanussot, and G. Lample, “Unsupervised translation of programming languages,” in NeurIPS, 2020.
  6. B. Rozière, J. Zhang, F. Charton, M. Harman, G. Synnaeve, and G. Lample, “Leveraging automated unit tests for unsupervised code translation,” in ICLR, OpenReview.net, 2022.
  7. M. Szafraniec, B. Roziere, H. L. F. Charton, P. Labatut, and G. Synnaeve, “Code translation with compiler representations,” ICLR, 2023.
  8. R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, and R. Jabbarvand, “Lost in translation: A study of bugs introduced by large language models while translating code,” 2024.
  9. P. Jana, P. Jha, H. Ju, G. Kishore, A. Mahajan, and V. Ganesh, “Attention, compilation, and solver-based symbolic analysis are all you need,” arXiv preprint arXiv:2306.06755, 2023.
  10. R. Puri, D. S. Kung, G. Janssen, W. Zhang, G. Domeniconi, V. Zolotov, J. Dolby, J. Chen, M. Choudhury, L. Decker, et al., “Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks,” arXiv preprint arXiv:2105.12655, 2021.
  11. W. U. Ahmad, M. G. R. Tushar, S. Chakraborty, and K.-W. Chang, “Avatar: A parallel corpus for java-python program translation,” arXiv preprint arXiv:2108.11590, 2021.
  12. J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  13. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
  14. P. Deligiannis, A. Lal, N. Mehrotra, and A. Rastogi, “Fixing rust compilation errors using llms,” arXiv preprint arXiv:2308.05177, 2023.
  15. J. Zhang, P. Nie, J. J. Li, and M. Gligoric, “Multilingual code co-evolution using large language models,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 695–707, 2023.
  16. Q. Zhang, J. Wang, G. H. Xu, and M. Kim, “Heterogen: transpiling c to heterogeneous hls code with automated test generation and program repair,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’22, (New York, NY, USA), p. 1017–1029, Association for Computing Machinery, 2022.
  17. B. Mariano, Y. Chen, Y. Feng, G. Durrett, and I. Dillig, “Automated transpilation of imperative to functional code using neural-guided program synthesis,” Proceedings of the ACM on Programming Languages, vol. 6, no. OOPSLA1, pp. 1–27, 2022.
  18. H. F. Eniser, V. Wüstholz, and M. Christakis, “Automatically testing functional properties of code translation models,” arXiv preprint arXiv:2309.12813, 2023.
  19. M. Jiao, T. Yu, X. Li, G. Qiu, X. Gu, and B. Shen, “On the evaluation of neural code translation: Taxonomy and benchmark,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1529–1541, IEEE, 2023.
  20. H. Zhang, C. David, Y. Yu, and M. Wang, “Ownership guided C to Rust translation,” in Computer Aided Verification (CAV), vol. 13966 of LNCS, pp. 459–482, Springer, 2023.
  21. M. Emre, R. Schroeder, K. Dewey, and B. Hardekopf, “Translating C to safer Rust,” Proceedings of the ACM on Programming Languages, vol. 5, no. OOPSLA, pp. 1–29, 2021.
  22. Y. Noller, C. S. Păsăreanu, M. Böhme, Y. Sun, H. L. Nguyen, and L. Grunske, “Hydiff: Hybrid differential software analysis,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 1273–1285, 2020.
  23. M. Böhme, B. C. d. S. Oliveira, and A. Roychoudhury, “Regression tests to expose change interaction errors,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pp. 334–344, 2013.
  24. H. Palikareva, T. Kuchta, and C. Cadar, “Shadow of a doubt: testing for divergences between software versions,” in Proceedings of the 38th International Conference on Software Engineering, pp. 1181–1192, 2016.
  25. S. Person, G. Yang, N. Rungta, and S. Khurshid, “Directed incremental symbolic execution,” Acm Sigplan Notices, vol. 46, no. 6, pp. 504–515, 2011.
  26. J. Guo, Y. Jiang, Y. Zhao, Q. Chen, and J. Sun, “Dlfuzz: Differential fuzzing testing of deep learning systems,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 739–743, 2018.
  27. W. Jin, A. Orso, and T. Xie, “Automated behavioral regression testing,” in 2010 Third international conference on software testing, verification and validation, pp. 137–146, IEEE, 2010.
  28. S. Nilizadeh, Y. Noller, and C. S. Pasareanu, “Diffuzz: differential fuzzing for side-channel analysis,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 176–187, IEEE, 2019.
  29. T. Petsios, A. Tang, S. Stolfo, A. D. Keromytis, and S. Jana, “Nezha: Efficient domain-independent differential testing,” in 2017 IEEE Symposium on security and privacy (SP), pp. 615–632, IEEE, 2017.
  30. W. Li, J. Ruan, G. Yi, L. Cheng, X. Luo, and H. Cai, “PolyFuzz: Holistic greybox fuzzing of Multi-Language systems,” in 32nd USENIX Security Symposium (USENIX Security 23), (Anaheim, CA), pp. 1379–1396, USENIX Association, Aug. 2023.
  31. J. J. Garzella, M. Baranowski, S. He, and Z. Rakamarić, “Leveraging compiler intermediate representation for multi- and cross-language verification,” in Verification, Model Checking, and Abstract Interpretation (D. Beyer and D. Zufferey, eds.), (Cham), pp. 90–111, Springer International Publishing, 2020.
  32. C. S. Xia, Y. Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in ICSE, IEEE, 2023.
  33. J. Kong, M. Cheng, X. Xie, S. Liu, X. Du, and Q. Guo, “Contrastrepair: Enhancing conversation-based automated program repair via contrastive test case pairs,” arXiv preprint arXiv:2403.01971, 2024.
  34. H. W. Kuhn, “The hungarian method for the assignment problem,” in 50 Years of Integer Programming 1958-2008 - From the Early Years to the State-of-the-Art (M. Jünger, T. M. Liebling, D. Naddef, G. L. Nemhauser, W. R. Pulleyblank, G. Reinelt, G. Rinaldi, and L. A. Wolsey, eds.), pp. 29–47, Springer, 2010.
  35. E. T. Bray, “The javascript object notation (json) data interchange format,” RFC 8259, RFC Editor, 12 2017.
  36. K. Serebryany, “Continuous fuzzing with libfuzzer and addresssanitizer,” in 2016 IEEE Cybersecurity Development (SecDev), pp. 157–157, 2016.
  37. C. S. Xia and L. Zhang, “Conversational automated program repair,” arXiv preprint arXiv:2301.13246, 2023.
  38. “Clippy: A bunch of lints to catch common mistakes and improve your rust code.” https://rust-lang.github.io/rust-clippy/.
  39. O. Tange, “Gnu parallel 20240122 (’frederik x’),” Jan. 2023. GNU Parallel is a general parallelizer to run multiple serial command line programs in parallel without changing them.
  40. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  41. “Claude.” https://www.anthropic.com/index/introducing-claude.
  42. “Gemini.” https://blog.google/technology/ai/google-gemini-ai/.
  43. A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al., “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024.
  44. “Moov ach.” https://github.com/moov-io/ach.
  45. “S2 geometry library in go.” https://github.com/golang/geo.
  46. “Open source implementation of audio processing technology codec (aptx).” https://github.com/pali/libopenaptx.
  47. “Engine for making things with a ms-dos feel, but for modern platforms.” https://github.com/mattiasgustavsson/dos-like/blob/main/source/libs/opl.h.
  48. “go-gt.” https://github.com/ThePaw/go-gt.
  49. “String comparison and edit distance algorithms library.” https://github.com/hbollon/go-edlib.
  50. “2d triangulation library.” https://github.com/tchayen/triangolatte.
  51. S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “Llm is like a box of chocolates: the non-determinism of chatgpt in code generation,” 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Hasan Ferit Eniser (8 papers)
  2. Hanliang Zhang (4 papers)
  3. Cristina David (20 papers)
  4. Meng Wang (1063 papers)
  5. Brandon Paulsen (9 papers)
  6. Joey Dodds (2 papers)
  7. Daniel Kroening (80 papers)
  8. Maria Christakis (20 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com