Re-Executability Rate (R_exec)

Updated 28 February 2026

Re-Executability Rate (R_exec) is a metric that measures the percentage of artifacts—such as binaries or code snippets—that correctly compile, run, and preserve their intended semantics.
It relies on rigorous experimental protocols involving runtime tests, compiler checks, and success criteria tailored to domains like adversarial malware, code translation, and decompilation.
Empirical studies demonstrate R_exec’s value in guiding model improvements and ensuring operational validity across diverse environments and optimization settings.

The Re-Executability Rate ( $R_{exec}$ ) quantifies the fraction of artifacts—binaries, code snippets, or decompiled functions—produced in a transformation or synthesis process that not only compile or launch but also execute correctly and preserve required semantics. $R_{exec}$ has become an integral metric across domains such as adversarial malware modification, code translation by LLMs, analysis of code snippet quality, and neural-guided decompilation. It measures not just syntactic plausibility but real-world viability, acting as a bridge between theoretical advances and operational utility.

1. Formal Definitions and Mathematical Formulations

Across domains, $R_{exec}$ is universally defined as the ratio between the count of executable, semantically correct outputs and the total number produced:

$R_{exec} = \frac{|\mathcal{O}_{exec}|}{|\mathcal{O}_{total}|}$

where $|\mathcal{O}_{total}|$ is the total number of outputs (e.g., binaries, code translations, decompilations) and $|\mathcal{O}_{exec}|$ is the subset that (i) launches (e.g., compiles or runs without crash), and (ii) achieves target functional behavior (e.g., passes all test cases or retains original malicious payload).

Specific instantiations include:

Adversarial Malware Binaries: $\mathcal{B}_{adv}$ is the set of all adversarial binaries written, $\mathcal{B}_{exec}\subseteq \mathcal{B}_{adv}$ the subset remaining operational ( $R_{exec} = \frac{|\mathcal{B}_{exec}|}{|\mathcal{B}_{adv}|}$ ) (Benkraouda et al., 2021).
Code Translation and Decompilation: For LLM-generated code, $R_{exec}$ is the proportion of outputs that compile and pass a comprehensive test suite, often written as $R_{exec}$ 0, where $R_{exec}$ 1 is number of translation/decompilation targets (He et al., 30 Jan 2025, Wang et al., 3 Nov 2025).
Code Snippet Executability: For mined code snippets, $R_{exec}$ 2 reflects the percentage that run to completion in at least one environment variant (e.g., Python 2.7 or 3.7) without errors (Hossain et al., 2019).

All empirical studies instrument the indicator function: $R_{exec}$ 3 with domain-specific criteria for “success”.

2. Measurement Protocols and Experimental Workflows

$R_{exec}$ 4's robustness as a metric is grounded in rigorous experimental procedures, typically requiring actual execution—mere syntactic analysis or static checks are insufficient.

Key procedures include:

Adversarial Binary Evaluation (Benkraouda et al., 2021): Each rewritten executable is run in a Windows testbed. A binary is classified as re-executable only if (a) it starts and does not crash, (b) it exhibits original malicious functionality. High-throughput, automated harnesses ensure scale and consistency.
Code Generation Benchmarks (He et al., 30 Jan 2025, Wang et al., 3 Nov 2025): For each candidate, the artifact must compile under standard toolchains (e.g., g++, javac, CPython) and pass all provided test cases. Multiple language pairs (C++⇄Java, etc.) and environments are covered, with success requiring full test suite compliance.
Large-Scale Code Snippet Analysis (Hossain et al., 2019): For each snippet, execution is attempted under both Python 2.7 and 3.7 inside Dockerized containers endowed with the 40 most common libraries. Outcomes are classified via return codes, with auto-install of missing dependencies. Timeouts and explicit error captures ensure fidelity.

Empirical studies emphasize that $R_{exec}$ 5 is a direct measurement and not an inference from match-based or static metrics.

3. Empirical Results Across Domains

Empirical $R_{exec}$ 6 values vary significantly with artifact type, task complexity, and environmental constraints. Key reported figures:

Domain	Dataset/Families	$R_{exec}$ 7	Reference
Malware adversarial binary attacks	Dialplatform.B	98.9%	(Benkraouda et al., 2021)
	Lolyda.AA3	81.8%	(Benkraouda et al., 2021)
LLM Code Translation (ExeCoder, avg.)	TransCoder-test-X	83.0%	(He et al., 30 Jan 2025)
	GPT-4 (closed)	81.6%	(He et al., 30 Jan 2025)
Neural-guided decompilation (ICL4D-R)	HumanEval-Decompile	54.3% (O0)	(Wang et al., 3 Nov 2025)
	ExeBench	36.2% (O2)	(Wang et al., 3 Nov 2025)
Stack Overflow Python snippets	SOTorrent All	27.9%	(Hossain et al., 2019)

Key trends:

Malware Modification: High $R_{exec}$ 8 ( $R_{exec}$ 9) when domain knowledge (e.g., NOP-injection at safe boundaries) is harnessed. Larger binaries (complexity, optimizer challenges) reduce $R_{exec}$ 0 slightly.
Code Generation: Advanced LLMs (ExeCoder) achieve robust $R_{exec}$ 1, consistently outperforming match-based metrics or closed-source competitors; enhancing input representations boosts $R_{exec}$ 2 further.
Decompilation: Even state-of-the-art LLMs struggle at high optimization levels, but retrieval-augmented in-context learning leads to 40% relative improvements over prior baselines.
Web-Mined Code: Less than 30% of StackOverflow Python snippets execute out-of-the-box; executability is higher for code referenced from GitHub.

4. Relationship to Other Quality Metrics and Limitations

$R_{exec}$ 3 is distinct from metrics based on syntactic form (BLEU, CodeBLEU, Exact Match):

Correlation: Weak to moderate correlation with CodeBLEU (Pearson ≈ 0.4–0.6) and negligible with Exact Match (He et al., 30 Jan 2025). High match-score predictions may not compile or run.
Semantic Fidelity: $R_{exec}$ 4 enforces both syntactic and behavioral requirements; code that compiles but fails tests is penalized.

Limitations:

Test Environment Scope: Results depend on the rigor and realism of the test harness (OS version, compiler flags, anti-tamper defenses).
Timeouts and Resource Limits: Strict time or memory limits may undercount viable outputs.
Narrow Input Conditions: For malware, results do not cover advanced dynamic analyses; for code, library versioning and edge-case dependency coverage may exclude some practical scenarios.
Specification Adherence: Passing all test cases is necessary but not always sufficient for full correctness; overfitting to observed inputs can occur.

A plausible implication is that reported $R_{exec}$ 5 values are best interpreted relative to their task specification and test protocol, with generalization beyond the study contingent on environmental similarity.

5. Methodological Advances in Raising $R_{exec}$ 6

Multiple strategies to improve $R_{exec}$ 7 have been validated:

Augmented Input Representations: In code translation, feeding models functional summaries, abstract syntax trees (ASTs), and data-flow graphs (DFGs) as auxiliary signals incrementally increases $R_{exec}$ 8 (+2.3 percentage points aggregate over code-only baselines) (He et al., 30 Jan 2025).
Progressive Curriculum Fine-Tuning: Stagewise learning, where each stage adds richer execution-related features, enables LLMs to align completions with operational semantics.
In-Context Example Retrieval: In neural decompilation, exposing LLMs to semantically similar, previously decompiled exemplars helps reverse optimized compiler transformations, improving robustness at $R_{exec}$ 9– $R_{exec} = \frac{|\mathcal{O}_{exec}|}{|\mathcal{O}_{total}|}$ 0 (Wang et al., 3 Nov 2025).
Attack-Site Selection in Adversarial Binaries: Limiting perturbations to NOP-equivalent instruction boundaries ensures executability is preserved even under heavy image-space modifications (Benkraouda et al., 2021).
Environmental Control and Dependency Management: For code snippet analysis, pre-installing popular libraries and retrying after pip install raises observable $R_{exec} = \frac{|\mathcal{O}_{exec}|}{|\mathcal{O}_{total}|}$ 1.

A shared insight across studies is the value of explicitly modeling or injecting executability constraints, rather than relying solely on semantic plausibility.

6. Implications, Trends, and Future Directions

$R_{exec} = \frac{|\mathcal{O}_{exec}|}{|\mathcal{O}_{total}|}$ 2 is increasingly regarded as the definitive metric for operational validity in generated code, malware modification, and reverse engineering. Principal findings include:

Practicability: High $R_{exec} = \frac{|\mathcal{O}_{exec}|}{|\mathcal{O}_{total}|}$ 3 supports claims of real-world viability; in attacks, it denotes threat persistence, while in synthesis, it captures deployability.
Drift and Stasis: Longitudinal analyses show stable $R_{exec} = \frac{|\mathcal{O}_{exec}|}{|\mathcal{O}_{total}|}$ 4 in StackOverflow code snippets over a decade, with subtle shifts toward Python 3 compatibility (Hossain et al., 2019).
Dataset Bias: Presence of GitHub references is statistically associated with a 7.2 percentage point boost in snippet executability, while accepted answer status is not predictive (Hossain et al., 2019).
Robustness across Optimizations: In decompilation, in-context methods maintain higher $R_{exec} = \frac{|\mathcal{O}_{exec}|}{|\mathcal{O}_{total}|}$ 5 under aggressive compiler optimizations relative to purely generative or rule-based baselines (Wang et al., 3 Nov 2025).
Research Direction: For LLMs, future models are recommended to incorporate explicit executability representations and progressive learning protocols to sustain and enhance $R_{exec} = \frac{|\mathcal{O}_{exec}|}{|\mathcal{O}_{total}|}$ 6 (He et al., 30 Jan 2025).
Open Challenge: Translating high $R_{exec} = \frac{|\mathcal{O}_{exec}|}{|\mathcal{O}_{total}|}$ 7 in controlled benchmarks to diverse, adversarial, or resource-constrained production environments remains an open challenge.

In summary, the Re-Executability Rate ( $R_{exec} = \frac{|\mathcal{O}_{exec}|}{|\mathcal{O}_{total}|}$ 8) offers a unifying framework for gauging not just whether generated or transformed artifacts look plausible, but whether they truly function as intended when deployed—a crucial benchmark for both system security and program synthesis research (Benkraouda et al., 2021, He et al., 30 Jan 2025, Hossain et al., 2019, Wang et al., 3 Nov 2025).