Towards Translating Real-World Code with LLMs: A Study of Translating to Rust (2405.11514v2)

Published 19 May 2024 in cs.SE

Abstract: LLMs show promise in code translation - the task of translating code written in one programming language to another language - due to their ability to write code in most programming languages. However, LLM's effectiveness on translating real-world code remains largely unstudied. In this work, we perform the first substantial study on LLM-based translation to Rust by assessing the ability of five state-of-the-art LLMs, GPT4, Claude 3, Claude 2.1, Gemini Pro, and Mixtral. We conduct our study on code extracted from real-world open source projects. To enable our study, we develop FLOURINE, an end-to-end code translation tool that uses differential fuzzing to check if a Rust translation is I/O equivalent to the original source program, eliminating the need for pre-existing test cases. As part of our investigation, we assess both the LLM's ability to produce an initially successful translation, as well as their capacity to fix a previously generated buggy one. If the original and the translated programs are not I/O equivalent, we apply a set of automated feedback strategies, including feedback to the LLM with counterexamples. Our results show that the most successful LLM can translate 47% of our benchmarks, and also provides insights into next steps for improvements.

PDF HTML Abstract

Assessing LLMs for Translating Real-World Code to Rust

The paper "Towards Translating Real-World Code with LLMs: A Study of Translating to Rust" addresses the challenges of translating code from various programming languages to Rust using LLMs. This paper is pivotal as it shifts focus from traditional competitive programming benchmarks to the more complex and variable field of real-world code.

Methodology and Tools

The authors introduce Flourine, a comprehensive tool designed to facilitate this translation process. Flourine's primary function is to employ differential fuzzing to ensure that translated Rust code maintains input/output equivalency with the original source code, which negates the requirement for pre-existing test cases. The paper specifically evaluates five state-of-the-art LLMs: GPT-4, Claude 3, Claude 2.1, Gemini Pro, and Mixtral.

Evaluation and Results

The research evaluates the ability of these LLMs to perform out-of-the-box translations and to correct translations that initially exhibit bugs. They apply several automatic feedback strategies, including counterexamples, to improve the success rate of translations. The analysis covers 8160 translation experiments across 408 code samples sourced from diverse real-world projects, predominantly written in C and Go.

Key insights from the paper reveal that LLMs like Claude 2.1 and Claude 3 have the highest success rates, translating 47% of benchmarks successfully, while Mixtral achieved the lowest success rates at about 21%. The success rates also varied significantly depending on the complexity of the code, such as the number of lines and functions.

Addressing Challenges

The paper identifies that translating larger code samples results in decreased accuracy, which the authors attribute to the inherent stochastic nature of LLMs, making multiple correct token predictions less likely as the code length increases. They propose dividing larger programs into smaller segments as a potential strategy for improving translation success rates. Furthermore, by running Clippy, Rust's linting tool, they observe that while the translations are often syntactically correct, there is room for improvement in adhering to idiomatic Rust code guidelines.

Contrast with Rule-Based Translations

The paper also contrasts LLM-based translations with traditional rule-based translation tools like C2Rust. While rule-based tools ensure syntactic correctness, they often produce verbose and non-idiomatic code. In contrast, LLMs tend to generate more concise and idiomatic Rust code.

Implications and Future Directions

This research has significant practical implications, particularly for developers seeking to modernize legacy code bases by translating them to safer languages like Rust. Theoretically, it opens avenues for further research in improving LLM-based code translation accuracy, especially in addressing larger and more complex code structures.

Future research could explore enhanced feedback mechanisms for LLMs to learn from counterexamples more effectively, and investigate techniques for better segmentation of code to handle complexity. With the ongoing development of LLMs, further studies might also refine the models' capabilities in understanding and generating code that closely aligns with language-specific idioms and standards.

Overall, this paper exemplifies a robust exploratory paper into leveraging advanced AI models for practical software engineering tasks, highlighting challenges and offering insights into potential improvements that align with current and future directions in AI-assisted programming.

PDF Markdown Bookmark Chat (Pro)

References (51)

Authors (8)

Hasan Ferit Eniser (8 papers)
Hanliang Zhang (4 papers)
Cristina David (20 papers)
Meng Wang (1063 papers)
Brandon Paulsen (9 papers)
Joey Dodds (2 papers)
Daniel Kroening (80 papers)
Maria Christakis (20 papers)

Citations (7)

View on Semantic Scholar

Tweets

https://twitter.com/ComputerPapers/status/1792805798579491164