ClassEval-Obf Benchmark: Evaluating LLMs on Obfuscated Code

Updated 8 October 2025

ClassEval-Obf is an obfuscation-enhanced benchmark that isolates structural code semantics from natural language naming cues.
It applies techniques like alpha-renaming, ambiguous substitution, and misleading semantics to preserve functionality while suppressing linguistic shortcuts.
Empirical evaluations show significant performance drops in LLMs, highlighting their reliance on identifier names for intent interpretation and execution tasks.

ClassEval-Obf is an obfuscation-enhanced benchmark curated to assess the code understanding and generalization capacity of LLMs while systematically suppressing natural language naming cues in source code. Building upon the original ClassEval dataset—which is designed for evaluating class-level code reasoning, generation, and summarization—ClassEval-Obf introduces semantics-preserving obfuscation techniques, ensuring that surface-level identifier information is minimized or distorted, while program behavior remains invariant. The benchmark is motivated by empirical findings that state-of-the-art LLMs frequently exploit natural identifier names as shortcuts for both execution and intent-level reasoning tasks, rather than relying solely on structural semantics (Le et al., 3 Oct 2025).

1. Motivation and Conceptual Foundations

The development of ClassEval-Obf is grounded in the observation that source code communicates meaning through dual channels:

Structural semantics: syntax, control flow, data flow, and the program's formal execution behavior.
Human-naturalness channel: identifier names, comments, and other linguistic constructs mapping to programmer intent.

Empirical analysis demonstrated that LLMs achieve artificially high performance on benchmarks when allowed to leverage the naturalness channel. For example, in summarization tasks, identifiers act as “semantic anchors” guiding models in generating intent-level descriptions. In execution tasks, identifiers can serve as latent cues to trigger memorized code/output associations. ClassEval-Obf is designed to obviate these shortcuts, isolating the core semantic reasoning abilities of LLMs (Le et al., 3 Oct 2025).

2. Obfuscation Methodologies

ClassEval-Obf applies systematically constructed semantics-preserving obfuscations. Major strategies include:

Alpha-renaming: All identifiers—including class, method, and variable names—are replaced by role-labeled but semantically void placeholders (e.g., class1, method1, var1, etc.), preserving scope structure but suppressing linguistic content.
Ambiguous identifier substitution: Names are replaced by syntactically legal but human-meaningless tokens (e.g., llllIII, IlllIllllIlI), frustrating any natural mapping to programmer intent.
Cross-domain term mapping: Domain-relevant names are substituted with terms from unrelated fields, e.g., replacing an accounting variable “balance” with a chemical name “glucagon_d6”, eliminating latent domain cues.
Misleading semantics: Intentional semantic inversions are introduced, as in naming an accumulator variable “compute_max” in a summing context, to actively mislead models that rely on mnemonics instead of execution structure.

Obfuscations are applied deterministically across the dataset with strict guarantees that the result is semantics-preserving—unit tests and output signatures remain identical post-transformation. Formally, for a program $P$ with identifier set $I$ , the obfuscated program $P'$ is produced by a renaming function $f$ such that for all $id \in I$ , $f(id)$ is applied, $S(P') = S(P)$ (identical semantics), and $N(P')$ (the naturalness metric) is minimized.

3. Experimental Evaluation and Key Results

ClassEval-Obf was empirically validated across both intent-level and execution-level benchmarks:

Intent summarization: Performance drops substantially when naming cues are removed. For example, GPT-4o’s class-level summarization accuracy on ClassEval falls from 87.3% (original) to 58.7% (obfuscated), indicating heavy dependence on naming for intent interpretation (Le et al., 3 Oct 2025).
Execution and output prediction: Contrary to expectations that structure alone suffices, LLMs also exhibit consistent degradations post-obfuscation. For instance, GPT-4o’s Pass@1 on LiveCodeBench drops from 82.9% to 75.5% with simple alpha-renaming. This demonstrates that models sometimes leverage names even when code behavior is formally invariant under renaming.
Benchmark resilience: Datasets like LiveCodeBench, where variable names are sparse and short (median length ≈ 2), show only minor declines under obfuscation. A plausible implication is that task design can mediate the level of identifier leakage.

These results strongly suggest that many current LLM benchmarks systematically overestimate semantic reasoning by rewarding memorization patterns linked to identifier names.

4. Implications for LLM Evaluation and Development

The application of ClassEval-Obf reveals several crucial implications:

Benchmark reliability: By suppressing naming cues, ClassEval-Obf yields more reliable and realistic lower bounds for LLM performance on reasoning and generalization tasks.
Dependency on surface cues: Significant performance drops indicate that models optimize for shortcut solutions rather than deep structural reasoning—even in tasks where such reliance should be irrelevant.
Evaluation protocols: The use of ClassEval-Obf motivates the reporting of paired pre- and post-obfuscation metrics. This dual reporting can help researchers disentangle actual generalization from pattern matching or retrieval.
Model and training improvements: Training or fine-tuning LLMs using codebases where names are anonymized or intentionally misleading may encourage models to focus more on structural features.

5. Benchmark Construction and Integrity

The construction of ClassEval-Obf requires deterministic, semantics-preserving identifier replacements—either single-level (alpha-renaming) or more elaborate strategies. The process must ensure that:

All identifier replacements across code and documentation are capture-avoiding to prevent naming collisions.
No obfuscation introduces spurious behavior, as validated by comprehensive execution-based regression tests.
The naturalness metric $N(P')$ quantitatively decreases, ideally approaching that of randomly named or automatically generated identifier sets.

Obfuscation scripts are applied uniformly, and the benchmark consistently passes original test suite outcomes, ensuring experimental integrity.

6. Future Prospects and Research Directions

Several avenues for further work arise from the deployment of ClassEval-Obf:

Expansion of obfuscation strategies: Developing even more dynamic or fine-grained obfuscators to challenge models at deeper levels—for instance, by obfuscating local code idiom patterns or introducing syntactic “noise” that does not alter control/data flow.
Integration with execution trace-based evaluation: Merging with frameworks that require the prediction of intermediate states or control flow, strengthening assessment of true semantic reasoning (Chen et al., 25 Mar 2024).
Compositional robustness: Examining how models deal with code where both naming and structure are simultaneously obfuscated, simulating code “in the wild.”
Comparison with real-world codebases: Confirming that observations generalize from synthetic benchmarks to complex, organically evolved repositories.
Development of evaluation suites: Advocating that new LLM evaluation protocols include both canonical (natural) and obfuscated benchmarks as standard practice, improving the diagnostic granularity of model behavior.

7. Significance within the Broader Ecosystem

ClassEval-Obf constitutes an important benchmark for the rigorous evaluation of LLMs in code-related tasks. By isolating the structural reasoning channel from naming-based shortcuts, it encourages development and analysis of models that demonstrate genuine semantic comprehension and generalization. The methodology is directly extensible to future code reasoning, summarization, and synthesis tasks, and can inform both model architecture choices and training regimes for robust, real-world code understanding. The approach complements other advances in code reasoning evaluation and has motivated critical reevaluation of how current benchmarks reflect true model capabilities (Le et al., 3 Oct 2025).

Aspect	Before Obfuscation	After Obfuscation
Summarization Accuracy	High (e.g., 87.3%)	Markedly lower (58.7%)
Execution Pass@1	High (e.g., 82.9%)	Noticeably reduced (75.5%)
Identifier Lengths (median)	Variable/long	Short/ambiguous
Utility of naming cues	Critical	Suppressed

This evidence demonstrates the central role that naming cues play in LLM performance and underscores the need for systematic obfuscation benchmarks in future research.