- The paper introduces CREBench, a benchmark with 432 challenges over 48 cryptographic algorithms and three obfuscation tiers to test LLM reverse engineering capabilities.
- It employs a four-tier evaluation processโalgorithm identification, key extraction, code reimplementation, and flag recoveryโdemonstrating GPT-5.4's superior performance over other models.
- Findings reveal significant limitations in LLMs compared to human experts, emphasizing challenges in strategic planning and safe autonomy in high-obfuscation scenarios.
CREBench: A Rigorous Evaluation of LLMs in Cryptographic Binary Reverse Engineering
Problem Scope and Benchmark Design
Cryptographic binary reverse engineering (RE) is a critical operation within software and security analysis pipelines, supporting vulnerability discovery, forensic investigation, and cryptanalysis where binaries offer no source-level transparency. Manual RE is expertise-intensive, especially for cryptographic routines, motivating the study of LLM-driven automation in this high-stakes task. The "CREBench: Evaluating LLMs in Cryptographic Binary Reverse Engineering" paper (2604.03750) introduces CREBench, a comprehensive benchmark for systematically evaluating LLMs' autonomous RE capabilities targeted at cryptographic binaries.
CREBench is architected with 432 programmatic challenges. It spans 48 canonical cryptographic algorithms, each factored by three insecure key usage modes (hardcoded, fragmented, and weak pseudorandom) and three obfuscation/difficulty tiers (O0, O3, Const-XOR). This exhaustive cross-product ensures the benchmark reflects practical diversity in both cryptographic primitive and obfuscation strategy.
Figure 1: CREBench covers 432 challenges across 48 ciphers, 3 insecure key modes, and 3 reverse-engineering hardship levels, within a controlled agentic evaluation framework.
Challenges replicate the standard CTF binary RE idiom: the agent is exposed to an executable and its corresponding decompilation via Ghidra, with the goal of reconstructing the cryptographic logic and inputs necessary for successful serial validation (flag recovery). Notably, all cryptographic implementations eschew direct library imports; each was manually reimplemented and validated for functional provenance, ensuring the absence of trivial signature leakage or contamination.
Evaluation Methodology
A distinguishing feature of CREBench is its four-tiered evaluation protocol:
- Algorithm Identification: Determining the cryptographic primitive (e.g., distinguishing AES, DES, ARIA, etc.) via code structure and constants.
- Key/IV Extraction: Recovering embedded cryptographic material, potentially distributed or obfuscated.
- Wrapper-level Code Reimplementation: Producing a Python program replicating the full functional mapping of input to ciphertext.
- Flag Recovery: Precisely reconstructing the valid input that satisfies the checker, subsuming all upstream tasks.
This decomposition allows granular diagnosis, distinguishing between partial and full task competence.
Eight SOTA models are evaluated: GPT-5.4, GPT-5.4-mini, GPT-5.2, o4-mini, Gemini-2.5-Pro, Claude-Sonnet-4.6, Doubao-Seed-1.8, and MiMo-V2-Pro. Pass@3 metrics are adopted, in which the best outcome over three attempts is recorded per challenge. Human expert baseline performance is also measured; the human team used unrestricted toolchains (save for LLMs) and was time-limited.
The experimental results are striking in the separation observed between frontier LLMs, lower-tier models, and human experts. GPT-5.4 achieves an average score of 64.04/100 and a flag recovery rate of 59%, with a perfect trajectory completion rate (all four subtasks, pass@3) of 41.0%โsignificantly outperforming all competing models, yet remaining 28.15 points beneath the human baseline (92.19/100).
Figure 2: Pass@3 performance breakdown across models exhibits a clear hierarchy, with GPT-5.4 dominating lower-capacity and non-OpenAI baselines especially in semantically deep tasks.
Figure 3: GPT-5.4 achieves a 41.0% perfect completion rate, with performance dropping sharply across the evaluated models.
Task-level analysis shows that while lower-tier models (e.g., MiMo-V2-Pro, o4-mini) can sometimes address algorithm identification and key recovery, they fail to generalize to wrapper reimplementation or end-to-end flag recovery. The main bottlenecks are in wrapper-level behavioral reproduction and navigating heavy obfuscation, where static signatures are destroyed.
Agentic behavioral analysis reveals several recurring failure patterns:
Optimization and obfuscation settings are demonstrably effective. Performance degrades monotonically from O0 (unoptimized), to O3 (aggressively optimized), to Const-XOR (cryptographic signatures obfuscated at runtime):


Figure 5: Average pass@3 total scores drop consistently as binary complexity increases from O0 to O3 to Const-XOR.
Furthermore, comparison to OpenAI Codex and D-CIPHER showcases Codexโs superior agentic infrastructure, yielding higher recovery rates, but overall gains remain incremental given the underlying model's limitations.
Analysis of scoring breakdown across sub-tasks demonstrates a strong positive correlation between code reimplementation and flag recovery, but low correlation between key extraction and successful completion, due to endianness errors, confusion between key/IV, and incomplete program understanding.
Figure 6: Pairwise correlations among sub-tasks highlight a high association between code correctness and flag recovery.
Implications and Outlook
The empirical findings from CREBench (2604.03750) have substantial theoretical and practical ramifications for the field of trustworthy autonomy and agentic AI in security:
- Model Limitations: Even SOTA LLMs remain fundamentally limited in long-horizon, high-obfuscation RE workloads. In particular, cognitive deadlock, local optimum bias in algorithm hypotheses, and lack of global strategic planning in tool invocation are persistent obstacles.
- Dual-Use Security: Automated cryptographic RE presents clear dual-use risks; as LLMs approach parity with human experts, malicious exploitation potential escalates, underscoring the necessity for robust safety gating.
- Benchmark Utility: CREBench provides a definitive, scalable, and rigorous testbed not only for tracking progress but also for stress-testing safety-relevant alignment in agentic architectures.
- Obfuscation as a Roadblock: Simple obfuscation (e.g., Const-XOR) dramatically suppresses LLM performance, but CREBench does not yet incorporate extensive professional obfuscation (e.g., O-LLVM, Tigress), which is noted as a direction for future research as model context windows grow.
The demand for more advanced multi-agent orchestration, improved RE-centric reasoning chains, and deeper alignment are clear future avenues. Advancements must integrate refined agent strategies that interleave static, dynamic, and symbolic analysis with more robust hypothesis revision mechanisms and safety protocols.
Conclusion
CREBench (2604.03750) constitutes the first comprehensive cryptographic binary RE benchmark targeted at LLM agents, establishing new performance baselines and exposing substantial gaps to human-level expertise. While progress in agentic program analysis is non-trivial, the state of the art remains decisively behind skilled human practitioners, particularly under aggressive obfuscation and key fragmentation. The work enables structured, high-fidelity analysis of LLM capabilities, fosters quantitative safety evaluation in dual-use domains, and charts a concrete trajectory for future research at the intersection of AI, software security, and trustworthy autonomy.