CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation (2504.15254v1)

Published 21 Apr 2025 in cs.SE, cs.CL, and cs.LG

Abstract: C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation. By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files. The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness. We evaluate state-of-the-art LLMs on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques. We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety. You can find the dataset and code at https://github.com/anirudhkhatry/CRUST-bench.

Summary

An Analysis of CRUST-Bench: A Benchmark for C-to-Rust Transpilation

The paper “CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation” introduces a new benchmark designed to evaluate the ability of systems to transpile legacy C code into safe and idiomatic Rust. The benchmark addresses a significant gap in existing resources, providing a robust dataset aimed specifically at testing the conversion of complex, real-world C projects into Rust, a language renowned for its memory safety guarantees. Through comprehensive evaluation using various models, the paper also provides insights into the current state and limitations of automated code translation techniques.

The authors present CRUST-Bench as a meticulously curated dataset comprised of 100 open-source C repositories. Each repository is not merely a collection of isolated functions but is instead a complete codebase, thereby capturing the intricacies and challenges inherent in translating large-scale, multi-file projects. These repositories are complemented by manually crafted Rust interfaces that represent safe and idiomatic adaptations of the original C signatures and types. Additionally, the benchmark includes a suite of test cases designed to ensure that the translated Rust code is functionally equivalent to its C counterpart.

One of the critical contributions of this benchmark is its focus on memory safety, an essential aspect of Rust that eliminates many common vulnerabilities found in C. The interfaces provided within CRUST-Bench are engineered to enforce Rust's ownership model, guiding transpilers towards generating code that adheres to Rust's strict safety guarantees. This contrasts with prior benchmarks that permit the generation of unsafe Rust code, often failing to exploit Rust's full potential for safety.

In evaluating existing code generation models, the authors test several state-of-the-art LLMs, including OpenAI's GPT-4o and Claude-3.7 Sonnet. The results indicate that transpiling code from C to Rust remains a formidable challenge, with the best-performing models achieving a correct solution rate of only 15% in a single-shot setting. This starkly highlights the difficulty of the task, particularly given the stringent requirements of generating Rust code within its safe subset.

The paper also explores iterative repair approaches, wherein models attempt to correct errors based on compiler feedback. The two repair strategies—Compiler repair and Test repair—each show improvements over single-shot performance, although they introduce their own challenges. Compiler repair, which focuses solely on resolving compilation issues, significantly boosts compilation success rates, while Test repair, which incorporates feedback from test failures, yields further gains in functional correctness. However, Test repair demonstrates a trade-off between fixing logical errors and potentially degrading build stability.

A key observation from the paper is the insight gathered from error analysis. Even sophisticated models frequently struggle with Rust's borrowing semantics and type mismatches, issues compounded by Rust's demands for precise reasoning about memory and safety. These findings underscore the potential benefits of integrating static analysis techniques and Rust-specific domain knowledge into model training and fine-tuning processes, which could lead to better handling of these complex language features.

From a practical standpoint, the implications of this research are significant for developers aiming to migrate legacy C systems to Rust. Reliable automated transpilation tools empowered by benchmarks like CRUST-Bench could reduce the overhead of code migration while enhancing software security and maintainability. Moreover, the benchmark could serve as a catalyst for further research into safer code transpilation, encouraging the development of more advanced models that can handle the nuanced aspects of both languages.

In conclusion, the introduction of CRUST-Bench represents a meaningful step forward in the evaluation of C-to-Rust transpilation systems. By establishing a rigorous testing ground that mirrors realistic development scenarios, the authors provide valuable insights into the current capabilities and limitations of automated translation models. The benchmark sets a new standard for future research in this area, heralding advancements in tools that could streamline the transition from C to Rust, harnessing the latter's strengths for safer and more robust software development.

Related Papers

Find Related Papers

GitHub

GitHub - anirudhkhatry/CRUST-bench: Datasets and code for the CRUST-bench paper.

Tweets

https://twitter.com/AnirudhKhatry/status/1915088221421158614

https://twitter.com/AnirudhKhatry/status/1914726441490567337

YouTube

Show All Videos

HackerNews

A Comprehensive Benchmark for C-to-Safe-Rust Transpilation (2 points, 0 comments)