CREBench: Evaluating Large Language Models in Cryptographic Binary Reverse Engineering

Published 4 Apr 2026 in cs.CR, cs.AI, and cs.CL | (2604.03750v1)

Abstract: Reverse engineering (RE) is central to software security, particularly for cryptographic programs that handle sensitive data and are highly prone to vulnerabilities. It supports critical tasks such as vulnerability discovery and malware analysis. Despite its importance, RE remains labor-intensive and requires substantial expertise, making LLMs a potential solution for automating the process. However, their capabilities for RE remain systematically underexplored. To address this gap, we study the cryptographic binary RE capabilities of LLMs and introduce \textbf{CREBench}, a benchmark comprising 432 challenges built from 48 standard cryptographic algorithms, 3 insecure crypto key usage scenarios, and 3 difficulty levels. Each challenge follows a Capture-the-Flag (CTF) RE challenge, requiring the model to analyze the underlying cryptographic logic and recover the correct input. We design an evaluation framework comprising four sub-tasks, from algorithm identification to correct flag recovery. We evaluate eight frontier LLMs on CREBench. GPT-5.4, the best-performing model, achieves 64.03 out of 100 and recovers the flag in 59\% of challenges. We also establish a strong human expert baseline of 92.19 points, showing that humans maintain an advantage in cryptographic RE tasks. Our code and dataset are available at https://github.com/wangyu-ovo/CREBench.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces CREBench, a benchmark with 432 challenges over 48 cryptographic algorithms and three obfuscation tiers to test LLM reverse engineering capabilities.
It employs a four-tier evaluation process—algorithm identification, key extraction, code reimplementation, and flag recovery—demonstrating GPT-5.4's superior performance over other models.
Findings reveal significant limitations in LLMs compared to human experts, emphasizing challenges in strategic planning and safe autonomy in high-obfuscation scenarios.

CREBench: A Rigorous Evaluation of LLMs in Cryptographic Binary Reverse Engineering

Problem Scope and Benchmark Design

Cryptographic binary reverse engineering (RE) is a critical operation within software and security analysis pipelines, supporting vulnerability discovery, forensic investigation, and cryptanalysis where binaries offer no source-level transparency. Manual RE is expertise-intensive, especially for cryptographic routines, motivating the study of LLM-driven automation in this high-stakes task. The "CREBench: Evaluating LLMs in Cryptographic Binary Reverse Engineering" paper (2604.03750) introduces CREBench, a comprehensive benchmark for systematically evaluating LLMs' autonomous RE capabilities targeted at cryptographic binaries.

CREBench is architected with 432 programmatic challenges. It spans 48 canonical cryptographic algorithms, each factored by three insecure key usage modes (hardcoded, fragmented, and weak pseudorandom) and three obfuscation/difficulty tiers (O0, O3, Const-XOR). This exhaustive cross-product ensures the benchmark reflects practical diversity in both cryptographic primitive and obfuscation strategy.

Figure 1: CREBench covers 432 challenges across 48 ciphers, 3 insecure key modes, and 3 reverse-engineering hardship levels, within a controlled agentic evaluation framework.

Challenges replicate the standard CTF binary RE idiom: the agent is exposed to an executable and its corresponding decompilation via Ghidra, with the goal of reconstructing the cryptographic logic and inputs necessary for successful serial validation (flag recovery). Notably, all cryptographic implementations eschew direct library imports; each was manually reimplemented and validated for functional provenance, ensuring the absence of trivial signature leakage or contamination.

Evaluation Methodology

A distinguishing feature of CREBench is its four-tiered evaluation protocol:

Algorithm Identification: Determining the cryptographic primitive (e.g., distinguishing AES, DES, ARIA, etc.) via code structure and constants.
Key/IV Extraction: Recovering embedded cryptographic material, potentially distributed or obfuscated.
Wrapper-level Code Reimplementation: Producing a Python program replicating the full functional mapping of input to ciphertext.
Flag Recovery: Precisely reconstructing the valid input that satisfies the checker, subsuming all upstream tasks.

This decomposition allows granular diagnosis, distinguishing between partial and full task competence.

Eight SOTA models are evaluated: GPT-5.4, GPT-5.4-mini, GPT-5.2, o4-mini, Gemini-2.5-Pro, Claude-Sonnet-4.6, Doubao-Seed-1.8, and MiMo-V2-Pro. Pass@3 metrics are adopted, in which the best outcome over three attempts is recorded per challenge. Human expert baseline performance is also measured; the human team used unrestricted toolchains (save for LLMs) and was time-limited.

Results: Model Performance and Analysis

The experimental results are striking in the separation observed between frontier LLMs, lower-tier models, and human experts. GPT-5.4 achieves an average score of 64.04/100 and a flag recovery rate of 59%, with a perfect trajectory completion rate (all four subtasks, pass@3) of 41.0%—significantly outperforming all competing models, yet remaining 28.15 points beneath the human baseline (92.19/100).

Figure 2: Pass@3 performance breakdown across models exhibits a clear hierarchy, with GPT-5.4 dominating lower-capacity and non-OpenAI baselines especially in semantically deep tasks.

Figure 3: GPT-5.4 achieves a 41.0% perfect completion rate, with performance dropping sharply across the evaluated models.

Task-level analysis shows that while lower-tier models (e.g., MiMo-V2-Pro, o4-mini) can sometimes address algorithm identification and key recovery, they fail to generalize to wrapper reimplementation or end-to-end flag recovery. The main bottlenecks are in wrapper-level behavioral reproduction and navigating heavy obfuscation, where static signatures are destroyed.

Agentic behavioral analysis reveals several recurring failure patterns:

Prototype Bias: Even advanced models overfit on familiar algorithmic prototypes (e.g., AES, DES) when structural matches are ambiguous or constants are hidden, causing erroneous overprediction and propagation of incorrect hypotheses.
Dynamic Analysis Deadlock: Models often excessively rely on GDB in failed runs, indicating a lack of strategic control in dynamic tool invocation and failing to re-synthesize high-level hypotheses from low-level evidence, resulting in deadlocked reasoning.
Insufficient Safety Refusal: Even GPT-5.4 rejects tasks based on perceived security risk in only 0.86% of attempts, reflecting weak alignment in high-risk RE contexts.
Figure 4: A successful end-to-end sample for GPT-5.4: autonomous solution of an AES-128-CBC (O0) challenge with hardcoded key in only 9 action rounds.

Optimization and obfuscation settings are demonstrably effective. Performance degrades monotonically from O0 (unoptimized), to O3 (aggressively optimized), to Const-XOR (cryptographic signatures obfuscated at runtime):

Figure 5: Average pass@3 total scores drop consistently as binary complexity increases from O0 to O3 to Const-XOR.

Furthermore, comparison to OpenAI Codex and D-CIPHER showcases Codex’s superior agentic infrastructure, yielding higher recovery rates, but overall gains remain incremental given the underlying model's limitations.

Analysis of scoring breakdown across sub-tasks demonstrates a strong positive correlation between code reimplementation and flag recovery, but low correlation between key extraction and successful completion, due to endianness errors, confusion between key/IV, and incomplete program understanding.

Figure 6: Pairwise correlations among sub-tasks highlight a high association between code correctness and flag recovery.

Implications and Outlook

The empirical findings from CREBench (2604.03750) have substantial theoretical and practical ramifications for the field of trustworthy autonomy and agentic AI in security:

Model Limitations: Even SOTA LLMs remain fundamentally limited in long-horizon, high-obfuscation RE workloads. In particular, cognitive deadlock, local optimum bias in algorithm hypotheses, and lack of global strategic planning in tool invocation are persistent obstacles.
Dual-Use Security: Automated cryptographic RE presents clear dual-use risks; as LLMs approach parity with human experts, malicious exploitation potential escalates, underscoring the necessity for robust safety gating.
Benchmark Utility: CREBench provides a definitive, scalable, and rigorous testbed not only for tracking progress but also for stress-testing safety-relevant alignment in agentic architectures.
Obfuscation as a Roadblock: Simple obfuscation (e.g., Const-XOR) dramatically suppresses LLM performance, but CREBench does not yet incorporate extensive professional obfuscation (e.g., O-LLVM, Tigress), which is noted as a direction for future research as model context windows grow.

The demand for more advanced multi-agent orchestration, improved RE-centric reasoning chains, and deeper alignment are clear future avenues. Advancements must integrate refined agent strategies that interleave static, dynamic, and symbolic analysis with more robust hypothesis revision mechanisms and safety protocols.

Conclusion

CREBench (2604.03750) constitutes the first comprehensive cryptographic binary RE benchmark targeted at LLM agents, establishing new performance baselines and exposing substantial gaps to human-level expertise. While progress in agentic program analysis is non-trivial, the state of the art remains decisively behind skilled human practitioners, particularly under aggressive obfuscation and key fragmentation. The work enables structured, high-fidelity analysis of LLM capabilities, fosters quantitative safety evaluation in dual-use domains, and charts a concrete trajectory for future research at the intersection of AI, software security, and trustworthy autonomy.

Markdown Report Issue