Evaluating LMs’ iterative, goal-oriented code development without explicit guidance

Develop rigorous, standardized evaluation methodologies and benchmarks to assess whether large language models can iteratively develop and refine software codebases to accomplish open-ended objectives without explicit task guidance, including clear protocols and metrics that reflect real-world goal pursuit rather than unit-test correctness.

Background

The paper motivates CodeClash by observing that most existing coding benchmarks evaluate models on narrowly specified, instruction-driven tasks such as bug fixing or unit-test completion. In contrast, real-world software engineering is driven by high-level objectives requiring strategic decomposition, iterative refinement, and adaptation based on feedback.

The authors explicitly note that determining how to evaluate whether LLMs can iteratively improve codebases toward open-ended goals, without explicit guidance, is an open challenge. CodeClash is introduced as an initial step toward addressing this gap, framing competitive, multi-round arenas where codebases act as proxies for goal pursuit. This open problem underlies the benchmark’s design and broader research agenda.

References

Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge.

— CodeClash: Benchmarking Goal-Oriented Software Engineering (2511.00839 - Yang et al., 2 Nov 2025) in Abstract (page 1)

Evaluating LMs’ iterative, goal-oriented code development without explicit guidance

Background

References

Related Problems