RTCE: RoundTrip Code Evaluation Framework

Updated 26 January 2026

RTCE is a benchmarking framework that tests large language models’ ability to perform exact round-trip code evaluations using bijective, execution-free transformations.
It employs round-trip consistency and bijection-based tasks on canonical compression algorithms to expose the limits of current Code-LLMs’ internal coherence.
RTCE uses metrics such as exact match and edit similarity to quantify mechanistic reasoning, highlighting challenges like unsolved Huffman Coding inversion.

RoundTripCodeEval (RTCE) is a benchmarking framework for evaluating the mechanistic code reasoning and invertibility capacities of LLMs through execution-free, bijection-based round-trip code execution tasks. By explicitly assessing whether models preserve a strict one-to-one mapping between encoding and decoding operations over canonical compression algorithms, RTCE exposes the limits of current Code-LLMs’ consistency and internal coherence, surfacing phenomena that standard code generation or I/O prediction benchmarks fail to capture (Maveli et al., 19 Jan 2026).

1. Foundational Principles: Round-Trip Consistency and Bijection

RTCE is predicated on the exact recovery property inherent to bijections: for a bijective transformation between input domain $\mathcal X$ and code domain $\mathcal Z$ via an encoder $enc$ and decoder $dec$ , true round-trip consistency demands

$\forall x\in\mathcal X,\quad dec\bigl(enc(x)\bigr)\;=\;x$

RTCE evaluates whether an LLM can simulate both $enc$ and $dec$ —even in inversion directions—without ever executing code, requiring that its predictions satisfy this property exactly, not merely up to functional or semantic equivalence.

Traditional code evaluation frameworks focus on I/O mapping or execution reasoning, constraining assessment to forward prediction. RTCE, by contrast, tests bidirectional reasoning: the ability to mentally invert transformations and verify full-cycle fidelity. This paradigm exposes superficial pattern-matching and memorization, demanding instead mechanistic understanding of algorithmic bijections.

2. RTCE Benchmark Task Suite

The RTCE benchmark consists of four execution reasoning tasks centered on canonical compression schemes: Run-Length Encoding (RLE), LZW, Arithmetic Encoding, and Huffman Coding. For each, reference $enc$ and $dec$ functions are provided. The tasks—explicitly formalized—are:

Task	Input Provided	Function Access	Required Prediction
Output Prediction	$x$	$enc$	$\hat z \approx enc(x)$
Input Prediction-Inv	$z$	$enc$	$\hat x' \approx dec(z)$ (must invert $enc$ )
Output Prediction-Inv	$x$	$dec$	$\hat z \approx enc(x)$ (must invert $dec$ )
Input Prediction	$z$	$dec$	$\hat x' \approx dec(z)$

Tasks 2 and 3 require the model to perform mental inversion of the provided function, testing its ability to “walk back” nontrivial algorithms. All tasks are single-step; the combined sequence forms a round-trip: $x \xrightarrow{enc} z \xrightarrow{dec} x'$ (Maveli et al., 19 Jan 2026).

3. Metrics, Evaluation, and Differentiation

RTCE employs a suite of metrics to diagnose specific failure modes and gauge bidirectional fidelity:

Exact Match (EM):

$EM \;=\;\frac1N\sum_{i=1}^N \mathbb{1}[\hat y_i=y_i]$

Demands strict equivalence of predicted and ground-truth codes or reconstructions.

Edit Similarity (ES): Quantifies normalized Levenshtein similarity, useful for identifying “near misses.”
Pass@5: Probability that at least one of the top-5 sampled completions exactly matches the reference.

All evaluation is execution-free: LLMs are never permitted to run or step through code; outputs are compared to reference artifacts (strings, tuples, bit-packed values) (Maveli et al., 19 Jan 2026).

This differentiates RTCE from:

I/O-prediction (forward-only) benchmarks which allow massive leeway for overfitting,
Execution-reasoning tests, which do not probe invertibility,
Natural-language round-trips (e.g., code → comment → code) that assess semantic but not syntactic or algorithmic fidelity.

The bijection test inherent to RTCE enforces that only perfect, mechanistic reasoning can guarantee success.

4. Experimental Setup and Quantitative Findings

A broad sweep of 20+ open-source Code-LLMs was evaluated, from 1B to 70B parameters, including Llama-3, Phi-3, Qwen2.5, CodeLlama, Mistral, and others. Three principal protocols were tested:

Zero-Shot Prompting: Models receive the reference function and an explicit “self-consistency” instruction for reasoning.
Self-Reflection / Multi-Turn Revision: The model critiques and edits its own outputs via two JSON-guided rounds, terminating on EM.
Supervised Fine-Tuning: QwQ-32B was fine-tuned (LoRA, rank 8, 3 epochs) on natural language execution trace chains derived from reference solutions.

Performance Hierarchy and Error Analysis

Empirically, forward Output Prediction (given $x$ , predict $enc(x)$ ) is the easiest; Output Prediction-Inv, Input Prediction, and Input Prediction-Inv increase in difficulty, with inversion of decoding (Input Prediction-Inv) being hardest.

Models $\mathbf{< 3B}$ : near-zero EM on all tasks.
$7{-}9$ B: moderate ES on RLE (up to 50%), but nearly zero EM for LZW, Arithmetic, and Huffman.
$14{-}32$ B: up to 57% EM (RLE), $\approx$ 30% (LZW), $<10\%$ (AE), and $\approx$ 0% (Huffman).
$70$B: minor gains, but Huffman remains wholly unsolved (EM $\approx$ 0).

Self-reflection yields modest absolute gains (e.g., AE zero-shot EM 0% $\rightarrow$ two-round EM $\approx$ 8%) but does not close the gap for hard tasks. Fine-tuning on execution traces boosts Pass@5 (RLE: $\sim$ 57%→80%, LZW: $\sim$ 24%→78%, AE: $\sim$ 15%→84%) but fails to render models invertible on Huffman (Maveli et al., 19 Jan 2026).

5. Pathologies and Key Insights

RTCE surfaces new classes of mechanistic reasoning failures:

Models largely memorize or pattern-match forward transformations, lacking the internal representations needed for reliable inversion.
Inversion tasks reveal pronounced asymmetries: models can compress but not reconstruct, or vice versa.
Huffman Coding, requiring tree-based reasoning and bit-level fidelity, is unsolved at all current scales—with EM $\approx$ 0 even at 70B parameters.
Self-reflection corrects superficial errors (e.g., off-by-one issues) but cannot induce the algorithmic state-tracking required for true bijection.
Even extensive fine-tuning on step-by-step execution traces fails to guarantee round-trip consistency, pointing to a dissociation between procedural recall and algorithmic coherence.

These findings reveal a fundamental deficiency in current Code-LLMs’ ability to reason mechanistically over algorithmic bijections, a gap not evident in forward-only benchmark settings (Maveli et al., 19 Jan 2026).

6. Dataset Construction, Reference Implementations, and Release Strategy

The RTCE dataset consists of 1,000 synthetic instances (250 per algorithm), spanning:

Patterned strings (palindromes, pangrams, repetition, and noise),
Structured logs (e.g., Apache access records),
YAML-like configs (Helm, Docker, Ansible),
Tabular CSV/TSV data (heterogeneous, sparse repeat patterns).

Each algorithm’s data family is stored as raw files and unified JSON Lines, with metadata, reference input-output pairs, and algorithm identifiers. Reference encoder and decoder implementations are included.

The release plan provides open access to tasks, prompt templates, an inference harness (vLLM), and fine-tuning scripts under an MIT license. Directory structure includes /data/{ae,lzw,rle,huffman}/, /prompts/, /eval/, and /finetune/ (Maveli et al., 19 Jan 2026).

7. RTCE in the Context of Broader RTC Methodology

RTCE is a rigorous extension of Round-Trip Correctness (RTC) methodology (Allamanis et al., 2024, Sharma, 2024), which generally evaluates whether forward and backward model passes (e.g., code↔natural language or patch generation↔review) preserve semantic identity. RTCE distinguishes itself by enabling execution-free, algorithm-specific, exact-match evaluation in the code domain, enforcing stricter bijection constraints than RTC formulations utilizing test-based or semantic similarity metrics. While prior RTC works have shown high empirical correlation ( $r \approx 0.95$ ) with pass@1 and downstream accuracy on code synthesis, RTCE demonstrates that true bidirectional code understanding is a substantially more challenging task, requiring advances beyond current scaling, prompting, or SFT strategies (Maveli et al., 19 Jan 2026, Allamanis et al., 2024, Sharma, 2024).

8. Future Directions and Open Problems

RTCE highlights open research directions:

Developing Code-LLMs with robust mechanistic invertibility and internal consistency,
Closing the large performance gaps on complex, tree-based or bit-level algorithms,
Integrating execution-free bijection diagnostics as a standard metric for code understanding,
Scaling dataset complexity and breadth (multi-module, multi-step transformations).

A plausible implication is that improvements in algorithmic state-tracking and symbolic manipulation will be necessary to achieve nonzero RTCE scores on the hardest tasks, especially Huffman Coding. RTCE thus serves as a diagnostic and developmental benchmark for trustworthy code-oriented LLMs and the broader evaluation of round-trip reasoning in complex symbolic domains (Maveli et al., 19 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (3)

Can LLMs Compress (and Decompress)? Evaluating Code Understanding and Execution via Invertibility (2026)

Unsupervised Evaluation of Code LLMs with Round-Trip Correctness (2024)

Patched RTC: evaluating LLMs for diverse software development tasks (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RoundTripCodeEval (RTCE).

RTCE: RoundTrip Code Evaluation Framework

1. Foundational Principles: Round-Trip Consistency and Bijection

2. RTCE Benchmark Task Suite

3. Metrics, Evaluation, and Differentiation

4. Experimental Setup and Quantitative Findings

Performance Hierarchy and Error Analysis

5. Pathologies and Key Insights

6. Dataset Construction, Reference Implementations, and Release Strategy

7. RTCE in the Context of Broader RTC Methodology

8. Future Directions and Open Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RTCE: RoundTrip Code Evaluation Framework

1. Foundational Principles: Round-Trip Consistency and Bijection

2. RTCE Benchmark Task Suite

3. Metrics, Evaluation, and Differentiation

4. Experimental Setup and Quantitative Findings

Performance Hierarchy and Error Analysis

5. Pathologies and Key Insights

6. Dataset Construction, Reference Implementations, and Release Strategy

7. RTCE in the Context of Broader RTC Methodology

8. Future Directions and Open Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research