CodeMonkeys: Scaling Test-Time Compute for Software Engineering (2501.14723v2)

Published 24 Jan 2025 in cs.LG

Abstract: Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explore this problem in the context of solving real-world GitHub issues from the SWE-bench dataset. Our system, named CodeMonkeys, allows models to iteratively edit a codebase by jointly generating and running a testing script alongside their draft edit. We sample many of these multi-turn trajectories for every issue to generate a collection of candidate edits. This approach lets us scale "serial" test-time compute by increasing the number of iterations per trajectory and "parallel" test-time compute by increasing the number of trajectories per problem. With parallel scaling, we can amortize up-front costs across multiple downstream samples, allowing us to identify relevant codebase context using the simple method of letting an LLM read every file. In order to select between candidate edits, we combine voting using model-generated tests with a final multi-turn trajectory dedicated to selection. Overall, CodeMonkeys resolves 57.4% of issues from SWE-bench Verified using a budget of approximately 2300 USD. Our selection method can also be used to combine candidates from different sources. Selecting over an ensemble of edits from existing top SWE-bench Verified submissions obtains a score of 66.2% and outperforms the best member of the ensemble on its own. We fully release our code and data at https://scalingintelligence.stanford.edu/pubs/codemonkeys.

Summary

The paper introduces CodeMonkeys, a framework that scales test-time compute for Large Language Models by using iterative code edits, executing tests, and generating multiple parallel solution trajectories.
CodeMonkeys achieves a 57.4% resolution rate on SWE-bench Verified issues, demonstrating the effectiveness of scaling test-time compute for solving real-world software problems.
The framework employs a robust selection strategy combining model-generated test voting with refined model-based selection, allowing it to integrate and outperform existing solutions.

The paper "CodeMonkeys: Scaling Test-Time Compute for Software Engineering" describes a novel methodology aimed at enhancing the capabilities of LLMs in solving real-world software engineering problems, particularly GitHub issues, as per the SWE-bench dataset. This research focuses on leveraging test-time compute, which can be scaled both serially by increasing the number of iterations per trajectory and in parallel by amplifying the number of trajectories per problem.

System Overview and Approach:

CodeMonkeys Framework: The system, CodeMonkeys, iteratively edits codebases by generating and executing testing scripts alongside draft edits. This dual state machine setup iteratively responds to execution feedback. The paper employs many multi-turn trajectories per issue, yielding several candidate edits.
Serial and Parallel Compute Scaling: Serial compute is scaled by increasing iterations per trajectory and by developing comprehensive tests for each codebase edit. Parallel compute scaling is achieved through multiple sampling of edit trajectories, amortizing up-front costs by allowing models to read extensive codebase contexts.
Context Identification: The system deploys a straightforward LLM-based methodology to scan files and identify relevant content before ranking them in terms of relevance. This approach significantly condenses the context size and is further amortized across all generated edits, which reduces costs.

Performance and Cost Analysis:

Coverage and Accuracy: The system exhibits a coverage of 69.8% on SWE-bench Verified, revealing the potential for substantial test-time compute benefits. CodeMonkeys achieves an overall resolution rate of 57.4% on issues from SWE-bench Verified.
Cost Efficiency: The system utilizes a budget of approximately 2300 USD for LLM inference, detailing costs at various stages including context retrieval, test generation, edit generation, and candidate selection.

Selection Strategy:

Voting and Model-Based Selection: CodeMonkeys integrates a dual-selection approach, utilizing model-generated tests to vote for potential solutions followed by a refined model-based selection. This dual-step strategy effectively narrows down candidates, significantly improving final edit selection.
Integration of Heterogeneous Sources: The robustness of this system extends to its ability to integrate and select among candidates from disparate sources, wherein it outperforms top SWE-bench submissions with a combined score of 66.2% when collaborating with external submissions.

Challenges and Future Directions:

The research highlights several areas for improvement:

Contextual Comprehension: Enhancing the initial file relevance and maximizing context windows could bolster the accuracy of solutions.
Edit Generation and Iteration: Expanding the potential for execution feedback beyond simple test results can iterate edits more effectively.
Selection Technique Enhancements: The integration of richer model-based selection methodologies, including leveraging repository-specific tests and potentially interactive feedback mechanisms, might further refine solution scores.

Overall, the paper underscores the substantial advantages of scaling test-time compute tailored to software engineering tasks, outlining a powerful mechanism to manage and improve LLM capabilities within realistic software development environments. The results signify notable progress in computational strategies for systematic problem-solving in software engineering contexts.

PDF Markdown

Tweets

https://twitter.com/arXivGPT/status/1884664089890480171

CodeMonkeys: Scaling Test-Time Compute for Software Engineering (2501.14723v2)

Summary

Related Papers

Tweets