Evaluating LLMs on Class-Level Code Generation with ClassEval
The rapidly evolving field of LLMs presents promising advancements in code generation capabilities. Recent studies predominantly focus on function-level or statement-level code generation, often represented by benchmarks such as HumanEval. However, these benchmarks do not fully capture the intricacies involved in generating structured, multi-method codes like classes. The paper "ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation" proposes addressing this gap by introducing ClassEval, the first benchmark specifically designed to evaluate class-level code generation tasks.
ClassEval provides a test suite and canonical implementations for 100 manually constructed class-level Python coding tasks, covering diverse topics such as management systems and game development. Constructed over approximately 500 person-hours, ClassEval challenges models to generate classes comprising multiple interdependent methods. This complex context is intended to replicate real-world software development scenarios, where code units are not isolated but interact with each other at various levels.
Empirical Evaluations
The paper evaluates 11 prominent LLMs within the ClassEval framework, assessing their performance using three distinct generation strategies: holistic generation, incremental generation, and compositional generation. Notably, GPT-4 and GPT-3.5 demonstrated superior performance, with an observable decline in performance across all models when tasked with class-level code, as opposed to method-level benchmarks like HumanEval. Class-level Pass@1 rates for GPT models were notably lower (37.0% for GPT-4 and 27.0% for GPT-3.5) than method-level tasks due to increased complexity. This substantial dip highlights the limitations in translating function-level proficiency to class-level contexts.
Among different generation strategies, the results indicate that holistic generation performed best for models like GPT-4 and GPT-3.5, which excel at incorporating extensive context. In contrast, other models benefited more from incremental and compositional strategies, likely due to challenges in processing long contextual instructions inherent in holistic approaches. The paper reveals interesting insights, such as the ability of models to handle field dependencies more effectively than method dependencies, indicating where future model training efforts might be directed.
Implications and Future Work
This work suggests that while LLMs have advanced in their ability to generate method-level code, the same cannot be assumed for the more sophisticated task of class-level code generation. The findings highlight that the method-level coding abilities of LLMs are not adequate indicators of their capabilities in generating class-level code effectively. Furthermore, the results provide valuable insights into which generation strategies may benefit specific types of LLMs, based on their capability profiles.
The ClassEval benchmark opens avenues for developing more robust LLMs that can handle complex coding tasks involving multiple interdependencies within a class structure. Future research could explore enhancing LLM architectures to better process long contextual inputs and improve understanding and integration of interdependent code structures. A focus on these areas may yield models capable of performing class-level code generation with the same adeptness seen in simpler code-generation tasks. This paper shines a light on the importance of benchmark diversification in fully assessing LLM capabilities and steering model advancement in practical application domains.