NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness (2401.15963v3)

Published 29 Jan 2024 in cs.SE, cs.AI, cs.CL, and cs.LG

Abstract: Existing evaluation benchmarks of LLMs of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. In real-world software engineering, developers think beyond functional correctness. They have requirements on "how" a functionality should be implemented to meet overall system design objectives like efficiency, security, and maintainability. They would also trust the code LMs more if the LMs demonstrate robust understanding of such requirements. We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements. We propose a prompting method, Coding Concepts (CoCo), as a way for a developer to communicate the domain knowledge to the LMs. We conduct an extensive evaluation of 27 code LMs. Our finding is that LMs generally falter when tested on our benchmark, hinting at fundamental blindspots in their training setups. Surprisingly, even the classification accuracy on functional-correctness instances derived from the popular HumanEval benchmark is low, calling in question the depth of their comprehension and the source of their success in generating functionally-correct code in the first place. We release our benchmark and evaluation scripts publicly at https://aka.ms/NoFunEval.

PDF Abstract

Evaluating the Non-Functional Capabilities of Code LLMs: Insights from NoFunEval Benchmark

The paper "NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness" undertakes a comprehensive exploration of LLMs' (LM) proficiency in addressing non-functional requirements in code generation and editing tasks. This paper ventures beyond conventional assessments that predominantly focus on achieving functional correctness in code generation, by introducing a novel benchmark termed NoFunEval. The benchmark is meticulously engineered to evaluate LMs on non-functional attributes such as latency, resource utilization, runtime efficiency, maintainability, and security.

Key Contributions and Methodology

Introduction of NoFunEval Benchmark: The paper proposes the NoFunEval benchmark, which encompasses tasks that reflect real-world software engineering challenges. The benchmark comprises three main tasks: NoFunEdit, which focuses on code editing to meet specific non-functional requirements; NoFunClassify, a comprehension task where models differentiate code snippets based on non-functional criteria; and HumanEvalClassify, which evaluates the ability of LMs to discern between functionally correct and incorrect code.
Coding Concepts (CoCo) Prompting Method: To enable LLMs to grasp and act upon non-functional requirements, a unique prompting method called Coding Concepts (CoCo) is devised. This method allows developers to convey domain knowledge succinctly and facilitates the models in comprehending and executing the necessary code edits.
Comprehensive Evaluation Strategy: The paper evaluates 22 code LMs on the NoFunEval benchmark using various novel metrics, including DiffBLEU for assessing code similarity in edits and domain-specific tools like CodeQL for maintainability and security assessments.
Insightful Results: The paper finds that current code LMs show a significant gap in understanding and implementing non-functional requirements. Moreover, even in scenarios derived from the HumanEval benchmark that test the models' understanding of functional correctness, LMs exhibit surprisingly low classification accuracy.

Implications and Future Research Directions

The findings from this paper prompt a reevaluation of current training approaches and evaluation paradigms for code LMs. The paper highlights fundamental blindspots in LMs' training frameworks, particularly regarding non-functional aspects of code quality which are crucial in broad software engineering applications.

Practical Implications: The research underscores the critical need for augmented training methodologies that incorporate diverse coding criteria beyond functional correctness. This paper prompts AI researchers and practitioners to integrate comprehensive domain-specific knowledge in training regimes to enhance LM performance on real-world software engineering tasks.

Theoretical Implications and Future Work: From a theoretical standpoint, the paper opens up new avenues for exploring how non-functional requirements can be quantitatively modeled and learned by AI systems. Moreover, the work suggests a trichotomy of LM abilities—generation, comprehension, and editing—that future models should be trained to balance. Subsequent research could focus on refining benchmarks like NoFunEval to cover broader application domains and developing new prompting strategies that seamlessly integrate domain knowledge into model training.

In conclusion, the NoFunEval benchmark signifies a pivotal step in evolving our understanding and expectations of code LLMs, urging future developments in AI that can prudently navigate both the functional and non-functional landscapes of software programming. This work sets a foundation for ongoing research to improve and better assess the practical applicability of LMs in software development environments.