Universal metric for evaluating LLM-generated code

Develop a universally accepted, holistic metric for evaluating all aspects of code generated by large language models, providing a general-purpose evaluation approach that is not limited to specific contexts or tasks.

Background

The paper assesses the quality of code generated for CAPEC and CWE mappings using GPT-4o and notes that existing evaluation metrics, such as unit tests and Pass@k, are task- or context-specific. Due to this limitation, the authors rely on manual evaluation for aspects like compilability, relevance, and readability.

They explicitly state that creating a universally accepted, holistic metric for evaluating all aspects of LLM-generated code remains an open research area, motivating the need for a comprehensive, standardized evaluation framework.

References

A universally accepted, holistic metric for evaluating all aspects of LLM-generated code is still an open research area.

From Theory to Practice: Code Generation Using LLMs for CAPEC and CWE Frameworks  (2604.02548 - Shahzad et al., 2 Apr 2026) in Section: Evaluation, Subsection: Evaluation of the generated code