RTCE: RoundTrip Code Evaluation Framework
- RTCE is a benchmarking framework that tests large language models’ ability to perform exact round-trip code evaluations using bijective, execution-free transformations.
- It employs round-trip consistency and bijection-based tasks on canonical compression algorithms to expose the limits of current Code-LLMs’ internal coherence.
- RTCE uses metrics such as exact match and edit similarity to quantify mechanistic reasoning, highlighting challenges like unsolved Huffman Coding inversion.
RoundTripCodeEval (RTCE) is a benchmarking framework for evaluating the mechanistic code reasoning and invertibility capacities of LLMs through execution-free, bijection-based round-trip code execution tasks. By explicitly assessing whether models preserve a strict one-to-one mapping between encoding and decoding operations over canonical compression algorithms, RTCE exposes the limits of current Code-LLMs’ consistency and internal coherence, surfacing phenomena that standard code generation or I/O prediction benchmarks fail to capture (Maveli et al., 19 Jan 2026).
1. Foundational Principles: Round-Trip Consistency and Bijection
RTCE is predicated on the exact recovery property inherent to bijections: for a bijective transformation between input domain and code domain via an encoder and decoder , true round-trip consistency demands
RTCE evaluates whether an LLM can simulate both and —even in inversion directions—without ever executing code, requiring that its predictions satisfy this property exactly, not merely up to functional or semantic equivalence.
Traditional code evaluation frameworks focus on I/O mapping or execution reasoning, constraining assessment to forward prediction. RTCE, by contrast, tests bidirectional reasoning: the ability to mentally invert transformations and verify full-cycle fidelity. This paradigm exposes superficial pattern-matching and memorization, demanding instead mechanistic understanding of algorithmic bijections.
2. RTCE Benchmark Task Suite
The RTCE benchmark consists of four execution reasoning tasks centered on canonical compression schemes: Run-Length Encoding (RLE), LZW, Arithmetic Encoding, and Huffman Coding. For each, reference and functions are provided. The tasks—explicitly formalized—are:
| Task | Input Provided | Function Access | Required Prediction |
|---|---|---|---|
| Output Prediction | |||
| Input Prediction-Inv | (must invert ) | ||
| Output Prediction-Inv | (must invert ) | ||
| Input Prediction |
Tasks 2 and 3 require the model to perform mental inversion of the provided function, testing its ability to “walk back” nontrivial algorithms. All tasks are single-step; the combined sequence forms a round-trip: (Maveli et al., 19 Jan 2026).
3. Metrics, Evaluation, and Differentiation
RTCE employs a suite of metrics to diagnose specific failure modes and gauge bidirectional fidelity:
- Exact Match (EM):
Demands strict equivalence of predicted and ground-truth codes or reconstructions.
- Edit Similarity (ES): Quantifies normalized Levenshtein similarity, useful for identifying “near misses.”
- Pass@5: Probability that at least one of the top-5 sampled completions exactly matches the reference.
All evaluation is execution-free: LLMs are never permitted to run or step through code; outputs are compared to reference artifacts (strings, tuples, bit-packed values) (Maveli et al., 19 Jan 2026).
This differentiates RTCE from:
- I/O-prediction (forward-only) benchmarks which allow massive leeway for overfitting,
- Execution-reasoning tests, which do not probe invertibility,
- Natural-language round-trips (e.g., code → comment → code) that assess semantic but not syntactic or algorithmic fidelity.
The bijection test inherent to RTCE enforces that only perfect, mechanistic reasoning can guarantee success.
4. Experimental Setup and Quantitative Findings
A broad sweep of 20+ open-source Code-LLMs was evaluated, from 1B to 70B parameters, including Llama-3, Phi-3, Qwen2.5, CodeLlama, Mistral, and others. Three principal protocols were tested:
- Zero-Shot Prompting: Models receive the reference function and an explicit “self-consistency” instruction for reasoning.
- Self-Reflection / Multi-Turn Revision: The model critiques and edits its own outputs via two JSON-guided rounds, terminating on EM.
- Supervised Fine-Tuning: QwQ-32B was fine-tuned (LoRA, rank 8, 3 epochs) on natural language execution trace chains derived from reference solutions.
Performance Hierarchy and Error Analysis
Empirically, forward Output Prediction (given , predict ) is the easiest; Output Prediction-Inv, Input Prediction, and Input Prediction-Inv increase in difficulty, with inversion of decoding (Input Prediction-Inv) being hardest.
- Models : near-zero EM on all tasks.
- B: moderate ES on RLE (up to 50%), but nearly zero EM for LZW, Arithmetic, and Huffman.
- B: up to 57% EM (RLE), 30% (LZW), (AE), and 0% (Huffman).
- $70$B: minor gains, but Huffman remains wholly unsolved (EM0).
Self-reflection yields modest absolute gains (e.g., AE zero-shot EM 0% two-round EM 8%) but does not close the gap for hard tasks. Fine-tuning on execution traces boosts Pass@5 (RLE: 57%→80%, LZW: 24%→78%, AE: 15%→84%) but fails to render models invertible on Huffman (Maveli et al., 19 Jan 2026).
5. Pathologies and Key Insights
RTCE surfaces new classes of mechanistic reasoning failures:
- Models largely memorize or pattern-match forward transformations, lacking the internal representations needed for reliable inversion.
- Inversion tasks reveal pronounced asymmetries: models can compress but not reconstruct, or vice versa.
- Huffman Coding, requiring tree-based reasoning and bit-level fidelity, is unsolved at all current scales—with EM0 even at 70B parameters.
- Self-reflection corrects superficial errors (e.g., off-by-one issues) but cannot induce the algorithmic state-tracking required for true bijection.
- Even extensive fine-tuning on step-by-step execution traces fails to guarantee round-trip consistency, pointing to a dissociation between procedural recall and algorithmic coherence.
These findings reveal a fundamental deficiency in current Code-LLMs’ ability to reason mechanistically over algorithmic bijections, a gap not evident in forward-only benchmark settings (Maveli et al., 19 Jan 2026).
6. Dataset Construction, Reference Implementations, and Release Strategy
The RTCE dataset consists of 1,000 synthetic instances (250 per algorithm), spanning:
- Patterned strings (palindromes, pangrams, repetition, and noise),
- Structured logs (e.g., Apache access records),
- YAML-like configs (Helm, Docker, Ansible),
- Tabular CSV/TSV data (heterogeneous, sparse repeat patterns).
Each algorithm’s data family is stored as raw files and unified JSON Lines, with metadata, reference input-output pairs, and algorithm identifiers. Reference encoder and decoder implementations are included.
The release plan provides open access to tasks, prompt templates, an inference harness (vLLM), and fine-tuning scripts under an MIT license. Directory structure includes /data/{ae,lzw,rle,huffman}/, /prompts/, /eval/, and /finetune/ (Maveli et al., 19 Jan 2026).
7. RTCE in the Context of Broader RTC Methodology
RTCE is a rigorous extension of Round-Trip Correctness (RTC) methodology (Allamanis et al., 2024, Sharma, 2024), which generally evaluates whether forward and backward model passes (e.g., code↔natural language or patch generation↔review) preserve semantic identity. RTCE distinguishes itself by enabling execution-free, algorithm-specific, exact-match evaluation in the code domain, enforcing stricter bijection constraints than RTC formulations utilizing test-based or semantic similarity metrics. While prior RTC works have shown high empirical correlation () with pass@1 and downstream accuracy on code synthesis, RTCE demonstrates that true bidirectional code understanding is a substantially more challenging task, requiring advances beyond current scaling, prompting, or SFT strategies (Maveli et al., 19 Jan 2026, Allamanis et al., 2024, Sharma, 2024).
8. Future Directions and Open Problems
RTCE highlights open research directions:
- Developing Code-LLMs with robust mechanistic invertibility and internal consistency,
- Closing the large performance gaps on complex, tree-based or bit-level algorithms,
- Integrating execution-free bijection diagnostics as a standard metric for code understanding,
- Scaling dataset complexity and breadth (multi-module, multi-step transformations).
A plausible implication is that improvements in algorithmic state-tracking and symbolic manipulation will be necessary to achieve nonzero RTCE scores on the hardest tasks, especially Huffman Coding. RTCE thus serves as a diagnostic and developmental benchmark for trustworthy code-oriented LLMs and the broader evaluation of round-trip reasoning in complex symbolic domains (Maveli et al., 19 Jan 2026).