ClassEval Benchmark for Class-Level Code Generation
- ClassEval Benchmark is a manually curated suite for evaluating LLMs on realistic class-level code synthesis with rich context and dependency structures.
- It uses rigorous metrics such as Pass@k and dependency coverage to assess compositional correctness and maintainability in Python classes.
- The benchmark extends to translation and robustness (ClassEval-T, ClassEval-Obf), highlighting challenges in scaling from function-level to class-level evaluation.
ClassEval Benchmark
The ClassEval Benchmark is a manually curated evaluation suite designed to rigorously assess LLMs on class-level code generation, with subsequent adaptations for translation, difficulty assessment, process-aware code synthesis, and identifier-obfuscation robustness. Developed to expose the gap between function-level and class-level code modeling, ClassEval is central to investigations of LLM capacity for state, context, and multi-method dependencies in realistic Python programming tasks, and has been expanded to multi-language translation (ClassEval-T) and robustness/intent evaluation (@@@@1@@@@) (Du et al., 2023, Xue et al., 2024, Le et al., 3 Oct 2025).
1. Benchmark Construction and Dataset Design
ClassEval comprises 100 hand-crafted Python class-level code generation tasks. Each task presents a natural-language class specification, a code skeleton (imports, class name, docstring, constructor, method signatures), and a canonical implementation. Key features:
- Complexity and Coverage: Each class on average has 4.12 methods, a typical class length of 45.7 lines, and an average context size of ~260 tokens—multiples of common function-level benchmarks such as HumanEval and MBPP. The canonical solutions are validated for semantic correctness against extensive unit tests (99.7% statement and 98.2% branch coverage), with 8.0 method tests and 33.1 class-wise tests per class (Du et al., 2023).
- Problem Diversity: Sourced from domain-generalized function-level tasks, PyPI libraries, and domain expert ideation, topics span management systems, data formatting, mathematical operations, game logic, file/database handling, and simple NLP algorithms.
- Dependency Structure: Methods exhibit 65.5% field-dependence, 26% inter-method calls, and 21.7% library usage; only 14.1% are fully standalone, underscoring the context-dependent coding requirements.
2. Evaluation Protocols and Metrics
ClassEval evaluation extends the standard function-level metrics, emphasizing compositional correctness and code semantics:
- Class-level Functional Correctness: The core metric is Pass@k, the proportion of generated classes (from k samples) that pass all prescribed unit and integration tests. The unbiased estimator is:
where is the number of samples, the number of correct ones (Du et al., 2023).
- Method-level Pass@k: Each method is evaluated independently, allowing partial credit in some research variants.
- Dependency Coverage: Measures for method and field use:
quantifying inclusion of ground-truth method/field references in generated solutions.
- Quality/Structure Metrics: For process models, SonarQube is applied (issue density per 10 non-comment LOC) to report maintainability, reliability, and “clean code” (Shafin et al., 12 Nov 2025).
3. Empirical Results and Observed Model Behavior
Systematic empirical evaluation reveals distinct degradation in LLM performance when scaling from isolated methods to class-level synthesis:
| Model | ClassEval Pass@1 | HumanEval Pass@1 |
|---|---|---|
| GPT-4 | 37.6% | 85.4% |
| GPT-3.5-turbo | 29.6% | 68.9% |
| WizardCoder | 12.2% | 59.8% |
| Instruct-StarCoder | 10.2% | 34.1% |
| SantaCoder (1.1B) | 8.6% | 14.6% |
- Function-level ability does not transfer: A consistent 20–50 percentage-point drop is observed in Pass@k when moving from HumanEval (function-level) to ClassEval (class-level), indicating that ability to model local context and state is a major limiting factor (Du et al., 2023).
- Strategy-Dependent Outcomes: Latest instruction-tuned and fill-in-the-middle LLMs (GPT-4/3.5) prefer holistic (full-class, one-shot) synthesis, which maximizes class-level correctness (+6–9% Pass@5), while smaller models perform better with incremental or compositional prompting.
- Common error types: AttributeErrors (field initialization), TypeErrors (structure mismatches), and KeyErrors dominate incorrect outputs, validating that dependency mismanagement, not basic syntax, is the central failure mode (Du et al., 2023).
- Process models impact: Multi-agent (Waterfall-style) role-based synthesis increases code maintainability and reduces missing code, but often reduces functional correctness by 30–40% for most models (except Claude-3.5-Haiku, which gains 9.5%). Testing increases error detection but amplifies semantic failures overall (Shafin et al., 12 Nov 2025).
4. Extensions: Translation, Difficulty, and Robustness
a. ClassEval-T: Code Translation
ClassEval-T provides parallel class-level translations in Python, Java, and C++:
- Migration protocol: Each of 94 ClassEval classes with 386 methods is meticulously ported (360 person-hours) and tested with ≥33.8 test cases/class, yielding 99.7% statement and 98.2% branch coverage in all languages.
- Evaluation: Compilation Success Rate (CSR), class/method-level computational accuracy (CA), and dependency recall (DEP) measure translation fidelity.
- Findings: LLMs drop 30–50 pp in CA when translating realistic classes versus method-level microtasks. “Holistic” translation (entire class at once) generally outperforms min-dependency or standalone migration (Xue et al., 2024).
b. TaskEval: Difficulty Assessment
TaskEval (HardEval) employs compositional difficulty metrics, combining functional correctness, CodeBLEU similarity, and prompt robustness:
- No classical IRT is used; instead, aggregated correctness/similarity over multiple prompts yields a continuous “difficulty” score
- Most ClassEval methods are rated easy (65% with difficulty <0.4), but topics featuring nested control flow and exception handling dominate the hard region. Single-prompt pass/fail labels have only 62% overlap with multi-prompt hardness, affirming that prompt variation is essential for fair LLM difficulty profiling (Tambon et al., 2024).
c. Robustness: Identifier Obfuscation (ClassEval-Obf)
ClassEval-Obf introduces semantics-preserving identifier transformations (alpha-renaming, ambiguity, cross-domain, misleading semantics) to stress-test LLMs for naming leakage:
- Obfuscation causes significant drops in summarization (up to 29 pp class-level drop) and nontrivial decreases in execution correctness (5–10 pp on high-complexity tasks).
- Adopting multi-obfuscation ensemble metrics reduces variance in model rankings, better isolating genuine semantic reasoning and limiting shortcuts from identifier memorization (Le et al., 3 Oct 2025).
5. Comparative Analysis, Limitations, and Future Recommendations
- Function- to Class-level Challenge: The inability of most LLMs to accurately model cross-method dependencies, state, and interface consistency at scale makes ClassEval a more discriminative tool for assessing true program synthesis competence than function-level evaluations (Du et al., 2023, Xue et al., 2024).
- Translation Difficulty: LLMs exhibit high variance in cross-language class migration, with substantial performance gaps favoring Python (reflecting pretraining bias and syntactic conciseness). Failure analysis is dominated by library/API mismatches, syntax, and dependency resolution issues (Xue et al., 2024).
- Prompt Engineering: Contextual granularity directly affects synthesis outcomes. For class-level or multi-method tasks, holistic and compositional prompts must be systematically studied to match model scaling properties.
- Benchmark Gaps: ClassEval lacks mixed-language, cross-class, or graph-structured scenario coverage. Corner cases and implicit assumptions in class docstrings/test suites may under-test helper routines; prompt and test set augmentation are recommended (Tambon et al., 2024).
- Robustness and Generalization: Identifier obfuscation is essential to mitigate operator-schema memorization by LLMs, especially in intent and summarization tasks. Reporting both pre- and post-obfuscation scores is recommended for future evaluations (Le et al., 3 Oct 2025).
- Human Comparable Performance: In analogous Java project-level settings, undergraduate students exceed LLM performance by large margins, underscoring the gap between current LLM synthesis capabilities and practical software development requirements (Cao et al., 2024).
6. Availability and Reuse
ClassEval assets are distributed with task skeletons, canonical solutions, and test suites. The GitHub repository (https://github.com/FudanSELab/ClassEval) provides reproducibility scripts for all major evaluation protocols, including support for different prompting strategies, sampling settings, and dependency analysis. ClassEval-T and ClassEval-Obf are available per respective publications. Evaluation code includes functional, dependency, and complexity reporting; ongoing work augments these resources to target comprehensive difficulty calibration and code robustness benchmarks (Du et al., 2023, Xue et al., 2024, Le et al., 3 Oct 2025).
7. Significance for the Field
The ClassEval suite establishes the reference standard for class-level code generation benchmarking, highlighting both the architectural and prompting limitations of current LLMs when scaling to more realistic software synthesis tasks. Its empirical results have identified systematic deficiencies in context modeling, code generalization, and dependency tracking—areas ripe for architectural innovation. Follow-on work (ClassEval-T, ClassEval-Obf, process-aware evaluations) has extended its impact to robustness and translation assessment, further defining the research agenda for program synthesis and automated software engineering evaluation (Du et al., 2023, Xue et al., 2024, Le et al., 3 Oct 2025, Shafin et al., 12 Nov 2025).