Evaluating LMs’ iterative, goal-oriented code development without explicit guidance
Develop rigorous, standardized evaluation methodologies and benchmarks to assess whether large language models can iteratively develop and refine software codebases to accomplish open-ended objectives without explicit task guidance, including clear protocols and metrics that reflect real-world goal pursuit rather than unit-test correctness.
References
Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge.
— CodeClash: Benchmarking Goal-Oriented Software Engineering
(2511.00839 - Yang et al., 2 Nov 2025) in Abstract (page 1)