Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code (2308.03109v3)

Published 6 Aug 2023 in cs.SE

Abstract: Code translation aims to convert source code from one programming language (PL) to another. Given the promising abilities of LLMs in code synthesis, researchers are exploring their potential to automate code translation. The prerequisite for advancing the state of LLM-based code translation is to understand their promises and limitations over existing techniques. To that end, we present a large-scale empirical study to investigate the ability of general LLMs and code LLMs for code translation across pairs of different languages, including C, C++, Go, Java, and Python. Our study, which involves the translation of 1,700 code samples from three benchmarks and two real-world projects, reveals that LLMs are yet to be reliably used to automate code translation -- with correct translations ranging from 2.1% to 47.3% for the studied LLMs. Further manual investigation of unsuccessful translations identifies 15 categories of translation bugs. We also compare LLM-based code translation with traditional non-LLM-based approaches. Our analysis shows that these two classes of techniques have their own strengths and weaknesses. Finally, insights from our study suggest that providing more context to LLMs during translation can help them produce better results. To that end, we propose a prompt-crafting approach based on the symptoms of erroneous translations; this improves the performance of LLM-based code translation by 5.5% on average. Our study is the first of its kind, in terms of scale and breadth, that provides insights into the current limitations of LLMs in code translation and opportunities for improving them. Our dataset -- consisting of 1,700 code samples in five PLs with 10K+ tests, 43K+ translated code, 1,748 manually labeled bugs, and 1,365 bug-fix pairs -- can help drive research in this area.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. 2018. Upgrading GitHub from Rails 3.2 to 5.2. https://github.blog/2018-09-28-upgrading-github-from-rails-3-2-to-5-2/.
  2. 2020a. Supporting Linux kernel development in Rust. https://lwn.net/Articles/829858/.
  3. 2020b. Transform monolithic Java applications into microservices with the power of AI. https://developer.ibm.com/tutorials/transform-monolithic-java-applications-into-microservices-with-the-power-of-ai/.
  4. 2020c. Will code move on to a language such as Rust? https://www.theregister.com/2020/06/30/hard_to_find_linux_maintainers_says_torvalds/.
  5. 2021. GitHub’s Journey from Monolith to Microservices. https://www.infoq.com/articles/github-monolith-microservices/.
  6. 2023. Apache Commons CLI. https://commons.apache.org/proper/commons-cli/.
  7. 2023. Artifact Website. https://github.com/Intelligent-CAT-Lab/PLTranslationEmpirical.
  8. 2023. C to Go Translator. https://github.com/gotranspile/cxgo.
  9. 2023. C2Rust Transpiler. https://github.com/immunant/c2rust.
  10. 2023. Click. https://click.palletsprojects.com/en/8.1.x/.
  11. 2023. CodeGeeX. https://github.com/THUDM/CodeGeeX/blob/main/tests/test_prompt.txt.
  12. 2023. GPT-4 Technical Report. https://cdn.openai.com/papers/gpt-4.pdf.
  13. 2023. Hugging Face Open LLM Leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  14. 2023. Java 2 CSharp Translator for Eclipse. https://sourceforge.net/projects/j2cstranslator/.
  15. 2023. Java to CSharp Converter. https://github.com/paulirwin/JavaToCSharp.
  16. 2023. Llama-2. https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/.
  17. 2023. py_compile—Compile Python source files. https://docs.python.org/3/library/py_compile.html.
  18. 2023. Sharpen - Automated Java-¿C# coversion. https://github.com/mono/sharpen.
  19. 2023. TheBloke Airoboros 13B. https://huggingface.co/TheBloke/airoboros-13B-HF.
  20. 2023. TheBloke Wizard Vicuna 13B. https://huggingface.co/TheBloke/Wizard-Vicuna-13B-Uncensored-HF.
  21. 2023. TIOBE Index. https://www.tiobe.com/tiobe-index/.
  22. On Codex Prompt Engineering for OCL Generation: An Empirical Study. arXiv preprint arXiv:2303.16244 (2023).
  23. Avatar: A parallel corpus for java-python program translation. arXiv preprint arXiv:2108.11590 (2021).
  24. SantaCoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 (2023).
  25. Boris Beizer. 1990. Software testing techniques.
  26. Migrating legacy software to the cloud with ARTIST. In 2013 17th European Conference on Software Maintenance and Reengineering. IEEE, 465–468.
  27. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  28. Pinzhen Chen and Gerasimos Lampouras. 2023. Exploring data augmentation for code generation tasks. arXiv preprint arXiv:2302.03499 (2023).
  29. Tree-to-tree neural networks for program translation. Advances in neural information processing systems 31 (2018).
  30. Legacy web application modernization by generating a REST service layer. IEEE Latin America Transactions 13, 7 (2015), 2379–2383.
  31. Aryaz Eghbali and Michael Pradel. 2022. CrystalBLEU: precisely and efficiently measuring the similarity of code. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12.
  32. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
  33. Automatic Software Repair: A Survey. IEEE Transactions on Software Engineering 45, 01 (jan 2019), 34–67. https://doi.org/10.1109/TSE.2017.2755013
  34. Challenges in migrating legacy software systems to the cloud—an empirical study. Information Systems 67 (2017), 100–113.
  35. ADELT: Transpilation Between Deep Learning Frameworks. arXiv preprint arXiv:2303.03593 (2023).
  36. Migrating monoliths to microservices-based customizable multi-tenant cloud-native apps. In 2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 170–177.
  37. DongNyeong Heo and Heeyoul Choi. 2022. End-to-End Training of Both Translation Models in the Back-Translation Framework. arXiv preprint arXiv:2202.08465 (2022).
  38. Jaemin Hong. 2023. Improving Automatic C-to-Rust Translation with Static Analysis. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 273–277.
  39. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018).
  40. A comprehensive study on deep learning bug characteristics. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 510–520.
  41. Repairing deep neural networks: Fix patterns and challenges. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1135–1146.
  42. Adaptive mixtures of local experts. Neural computation 3, 1 (1991), 79–87.
  43. Attention, Compilation, and Solver-based Symbolic Analysis are All You Need. arXiv preprint arXiv:2306.06755 (2023).
  44. Self-planning code generation with large language model. arXiv preprint arXiv:2303.06689 (2023).
  45. Repair is nearly generation: Multilingual program repair with llms. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 5131–5140.
  46. Mono2micro: a practical and effective tool for decomposing monolithic java applications to microservices. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 1214–1224.
  47. Phrase-based statistical translation of programming languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software. 173–184.
  48. Justas Kazanavičius and Dalius Mažeika. 2019. Migrating legacy software to microservices architecture. In 2019 Open Conference of Electrical, Electronic and Information Sciences (eStream). IEEE, 1–5.
  49. Transforming monolithic applications to microservices with Mono2Micro. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering. 3–3.
  50. DOBF: A deobfuscation pre-training objective for programming languages. Advances in Neural Information Processing Systems 34 (2021), 14967–14979.
  51. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
  52. Syntax and Domain Aware Model for Unsupervised Program Translation. arXiv preprint arXiv:2302.03908 (2023).
  53. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 (2023).
  54. Lexical statistical machine translation for language migration. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. 651–654.
  55. Migrating code with statistical machine translation. In Companion Proceedings of the 36th International Conference on Software Engineering. 544–547.
  56. Divide-and-conquer approach for multi-phase statistical migration for source code (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 585–596.
  57. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In The Eleventh International Conference on Learning Representations.
  58. CARGO: ai-guided dependency analysis for migrating monolithic applications to microservices architecture. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12.
  59. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  60. Hongyu Pei Breivold. 2020. Towards factories of the future: migration of industrial legacy automation systems in the cloud computing and Internet-of-things context. Enterprise Information Systems 14, 4 (2020), 542–562.
  61. Software modernization to embrace quantum technology. Advances in Engineering Software 151 (2021), 102933.
  62. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  63. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297 (2020).
  64. Unsupervised translation of programming languages. Advances in Neural Information Processing Systems 33 (2020), 20601–20611.
  65. Leveraging automated unit tests for unsupervised code translation. arXiv preprint arXiv:2110.06773 (2021).
  66. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
  67. Rajaraajeswari Settu and Pethuru Raj. 2013. Cloud application modernization and migration methodology. Cloud Computing: Methods and Practical Approaches (2013), 243–271.
  68. MUFIN: Improving Neural Repair Models with Back-Translation. arXiv preprint arXiv:2304.02301 (2023).
  69. Application of back-translation: a transfer learning approach to identify ambiguous software requirements. In Proceedings of the 2021 ACM Southeast Conference. 130–137.
  70. How to fine-tune bert for text classification?. In Chinese Computational Linguistics: 18th China National Conference, CCL 2019, Kunming, China, October 18–20, 2019, Proceedings 18. Springer, 194–206.
  71. TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills. arXiv preprint arXiv:2306.07285 (2023).
  72. Code translation with compiler representations. arXiv preprint arXiv:2207.03578 (2022).
  73. Johannes Thönes. 2015. Microservices. IEEE software 32, 1 (2015), 116–116.
  74. CodeStylist: A System for Performing Code Style Transfer Using Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 16485–16487.
  75. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Chi conference on human factors in computing systems extended abstracts. 1–7.
  76. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).
  77. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  78. Perfection not required? Human-AI partnerships in code translation. In 26th International Conference on Intelligent User Interfaces. 402–412.
  79. Better together? an evaluation of ai-supported code translation. In 27th International Conference on Intelligent User Interfaces. 369–391.
  80. BabelTower: Learning to Auto-parallelized Program Translation. In International Conference on Machine Learning. PMLR, 23685–23700.
  81. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
  82. Practical program repair in the era of large pre-trained language models. arXiv preprint arXiv:2210.14179 (2022).
  83. Chunqiu Steven Xia and Lingming Zhang. 2022. Less training, more repairing please: revisiting automated program repair via zero-shot learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 959–971.
  84. Chunqiu Steven Xia and Lingming Zhang. 2023. Conversational automated program repair. arXiv preprint arXiv:2301.13246 (2023).
  85. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1–10.
  86. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023).
  87. A Survey of Learning-based Automated Program Repair. arXiv preprint arXiv:2301.03270 (2023).
  88. Migrating legacy applications to the service Cloud. In Proceedings of the 14th Conference Companion on Object Oriented Programming Systems Languages and Applications. 59–68.
  89. An empirical study on TensorFlow program bugs. In Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis. 129–140.
  90. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568 (2023).
  91. On robustness of prompt-based semantic parsing with large pre-trained language model: An empirical study on codex. arXiv preprint arXiv:2301.12868 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Rangeet Pan (15 papers)
  2. Ali Reza Ibrahimzada (6 papers)
  3. Rahul Krishna (28 papers)
  4. Divya Sankar (1 paper)
  5. Lambert Pouguem Wassi (1 paper)
  6. Michele Merler (10 papers)
  7. Boris Sobolev (1 paper)
  8. Raju Pavuluri (5 papers)
  9. Saurabh Sinha (25 papers)
  10. Reyhaneh Jabbarvand (10 papers)
Citations (21)