Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 67 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 128 tok/s Pro
Kimi K2 204 tok/s Pro
GPT OSS 120B 461 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

ClassEval-Obf Benchmark: Evaluating LLMs on Obfuscated Code

Updated 8 October 2025
  • ClassEval-Obf is an obfuscation-enhanced benchmark that isolates structural code semantics from natural language naming cues.
  • It applies techniques like alpha-renaming, ambiguous substitution, and misleading semantics to preserve functionality while suppressing linguistic shortcuts.
  • Empirical evaluations show significant performance drops in LLMs, highlighting their reliance on identifier names for intent interpretation and execution tasks.

ClassEval-Obf is an obfuscation-enhanced benchmark curated to assess the code understanding and generalization capacity of LLMs while systematically suppressing natural language naming cues in source code. Building upon the original ClassEval dataset—which is designed for evaluating class-level code reasoning, generation, and summarization—ClassEval-Obf introduces semantics-preserving obfuscation techniques, ensuring that surface-level identifier information is minimized or distorted, while program behavior remains invariant. The benchmark is motivated by empirical findings that state-of-the-art LLMs frequently exploit natural identifier names as shortcuts for both execution and intent-level reasoning tasks, rather than relying solely on structural semantics (Le et al., 3 Oct 2025).

1. Motivation and Conceptual Foundations

The development of ClassEval-Obf is grounded in the observation that source code communicates meaning through dual channels:

  • Structural semantics: syntax, control flow, data flow, and the program's formal execution behavior.
  • Human-naturalness channel: identifier names, comments, and other linguistic constructs mapping to programmer intent.

Empirical analysis demonstrated that LLMs achieve artificially high performance on benchmarks when allowed to leverage the naturalness channel. For example, in summarization tasks, identifiers act as “semantic anchors” guiding models in generating intent-level descriptions. In execution tasks, identifiers can serve as latent cues to trigger memorized code/output associations. ClassEval-Obf is designed to obviate these shortcuts, isolating the core semantic reasoning abilities of LLMs (Le et al., 3 Oct 2025).

2. Obfuscation Methodologies

ClassEval-Obf applies systematically constructed semantics-preserving obfuscations. Major strategies include:

  • Alpha-renaming: All identifiers—including class, method, and variable names—are replaced by role-labeled but semantically void placeholders (e.g., class1, method1, var1, etc.), preserving scope structure but suppressing linguistic content.
  • Ambiguous identifier substitution: Names are replaced by syntactically legal but human-meaningless tokens (e.g., llllIII, IlllIllllIlI), frustrating any natural mapping to programmer intent.
  • Cross-domain term mapping: Domain-relevant names are substituted with terms from unrelated fields, e.g., replacing an accounting variable “balance” with a chemical name “glucagon_d6”, eliminating latent domain cues.
  • Misleading semantics: Intentional semantic inversions are introduced, as in naming an accumulator variable “compute_max” in a summing context, to actively mislead models that rely on mnemonics instead of execution structure.

Obfuscations are applied deterministically across the dataset with strict guarantees that the result is semantics-preserving—unit tests and output signatures remain identical post-transformation. Formally, for a program PP with identifier set II, the obfuscated program PP' is produced by a renaming function ff such that for all idIid \in I, f(id)f(id) is applied, S(P)=S(P)S(P') = S(P) (identical semantics), and N(P)N(P') (the naturalness metric) is minimized.

3. Experimental Evaluation and Key Results

ClassEval-Obf was empirically validated across both intent-level and execution-level benchmarks:

  • Intent summarization: Performance drops substantially when naming cues are removed. For example, GPT-4o’s class-level summarization accuracy on ClassEval falls from 87.3% (original) to 58.7% (obfuscated), indicating heavy dependence on naming for intent interpretation (Le et al., 3 Oct 2025).
  • Execution and output prediction: Contrary to expectations that structure alone suffices, LLMs also exhibit consistent degradations post-obfuscation. For instance, GPT-4o’s Pass@1 on LiveCodeBench drops from 82.9% to 75.5% with simple alpha-renaming. This demonstrates that models sometimes leverage names even when code behavior is formally invariant under renaming.
  • Benchmark resilience: Datasets like LiveCodeBench, where variable names are sparse and short (median length ≈ 2), show only minor declines under obfuscation. A plausible implication is that task design can mediate the level of identifier leakage.

These results strongly suggest that many current LLM benchmarks systematically overestimate semantic reasoning by rewarding memorization patterns linked to identifier names.

4. Implications for LLM Evaluation and Development

The application of ClassEval-Obf reveals several crucial implications:

  • Benchmark reliability: By suppressing naming cues, ClassEval-Obf yields more reliable and realistic lower bounds for LLM performance on reasoning and generalization tasks.
  • Dependency on surface cues: Significant performance drops indicate that models optimize for shortcut solutions rather than deep structural reasoning—even in tasks where such reliance should be irrelevant.
  • Evaluation protocols: The use of ClassEval-Obf motivates the reporting of paired pre- and post-obfuscation metrics. This dual reporting can help researchers disentangle actual generalization from pattern matching or retrieval.
  • Model and training improvements: Training or fine-tuning LLMs using codebases where names are anonymized or intentionally misleading may encourage models to focus more on structural features.

5. Benchmark Construction and Integrity

The construction of ClassEval-Obf requires deterministic, semantics-preserving identifier replacements—either single-level (alpha-renaming) or more elaborate strategies. The process must ensure that:

  • All identifier replacements across code and documentation are capture-avoiding to prevent naming collisions.
  • No obfuscation introduces spurious behavior, as validated by comprehensive execution-based regression tests.
  • The naturalness metric N(P)N(P') quantitatively decreases, ideally approaching that of randomly named or automatically generated identifier sets.

Obfuscation scripts are applied uniformly, and the benchmark consistently passes original test suite outcomes, ensuring experimental integrity.

6. Future Prospects and Research Directions

Several avenues for further work arise from the deployment of ClassEval-Obf:

  • Expansion of obfuscation strategies: Developing even more dynamic or fine-grained obfuscators to challenge models at deeper levels—for instance, by obfuscating local code idiom patterns or introducing syntactic “noise” that does not alter control/data flow.
  • Integration with execution trace-based evaluation: Merging with frameworks that require the prediction of intermediate states or control flow, strengthening assessment of true semantic reasoning (Chen et al., 25 Mar 2024).
  • Compositional robustness: Examining how models deal with code where both naming and structure are simultaneously obfuscated, simulating code “in the wild.”
  • Comparison with real-world codebases: Confirming that observations generalize from synthetic benchmarks to complex, organically evolved repositories.
  • Development of evaluation suites: Advocating that new LLM evaluation protocols include both canonical (natural) and obfuscated benchmarks as standard practice, improving the diagnostic granularity of model behavior.

7. Significance within the Broader Ecosystem

ClassEval-Obf constitutes an important benchmark for the rigorous evaluation of LLMs in code-related tasks. By isolating the structural reasoning channel from naming-based shortcuts, it encourages development and analysis of models that demonstrate genuine semantic comprehension and generalization. The methodology is directly extensible to future code reasoning, summarization, and synthesis tasks, and can inform both model architecture choices and training regimes for robust, real-world code understanding. The approach complements other advances in code reasoning evaluation and has motivated critical reevaluation of how current benchmarks reflect true model capabilities (Le et al., 3 Oct 2025).

Aspect Before Obfuscation After Obfuscation
Summarization Accuracy High (e.g., 87.3%) Markedly lower (58.7%)
Execution Pass@1 High (e.g., 82.9%) Noticeably reduced (75.5%)
Identifier Lengths (median) Variable/long Short/ambiguous
Utility of naming cues Critical Suppressed

This evidence demonstrates the central role that naming cues play in LLM performance and underscores the need for systematic obfuscation benchmarks in future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ClassEval-Obf.