- The paper introduces CodeMonkeys, a framework that scales test-time compute for Large Language Models by using iterative code edits, executing tests, and generating multiple parallel solution trajectories.
- CodeMonkeys achieves a 57.4% resolution rate on SWE-bench Verified issues, demonstrating the effectiveness of scaling test-time compute for solving real-world software problems.
- The framework employs a robust selection strategy combining model-generated test voting with refined model-based selection, allowing it to integrate and outperform existing solutions.
The paper "CodeMonkeys: Scaling Test-Time Compute for Software Engineering" describes a novel methodology aimed at enhancing the capabilities of LLMs in solving real-world software engineering problems, particularly GitHub issues, as per the SWE-bench dataset. This research focuses on leveraging test-time compute, which can be scaled both serially by increasing the number of iterations per trajectory and in parallel by amplifying the number of trajectories per problem.
System Overview and Approach:
- CodeMonkeys Framework: The system, CodeMonkeys, iteratively edits codebases by generating and executing testing scripts alongside draft edits. This dual state machine setup iteratively responds to execution feedback. The paper employs many multi-turn trajectories per issue, yielding several candidate edits.
- Serial and Parallel Compute Scaling: Serial compute is scaled by increasing iterations per trajectory and by developing comprehensive tests for each codebase edit. Parallel compute scaling is achieved through multiple sampling of edit trajectories, amortizing up-front costs by allowing models to read extensive codebase contexts.
- Context Identification: The system deploys a straightforward LLM-based methodology to scan files and identify relevant content before ranking them in terms of relevance. This approach significantly condenses the context size and is further amortized across all generated edits, which reduces costs.
Performance and Cost Analysis:
- Coverage and Accuracy: The system exhibits a coverage of 69.8% on SWE-bench Verified, revealing the potential for substantial test-time compute benefits. CodeMonkeys achieves an overall resolution rate of 57.4% on issues from SWE-bench Verified.
- Cost Efficiency: The system utilizes a budget of approximately 2300 USD for LLM inference, detailing costs at various stages including context retrieval, test generation, edit generation, and candidate selection.
Selection Strategy:
- Voting and Model-Based Selection: CodeMonkeys integrates a dual-selection approach, utilizing model-generated tests to vote for potential solutions followed by a refined model-based selection. This dual-step strategy effectively narrows down candidates, significantly improving final edit selection.
- Integration of Heterogeneous Sources: The robustness of this system extends to its ability to integrate and select among candidates from disparate sources, wherein it outperforms top SWE-bench submissions with a combined score of 66.2% when collaborating with external submissions.
Challenges and Future Directions:
The research highlights several areas for improvement:
- Contextual Comprehension: Enhancing the initial file relevance and maximizing context windows could bolster the accuracy of solutions.
- Edit Generation and Iteration: Expanding the potential for execution feedback beyond simple test results can iterate edits more effectively.
- Selection Technique Enhancements: The integration of richer model-based selection methodologies, including leveraging repository-specific tests and potentially interactive feedback mechanisms, might further refine solution scores.
Overall, the paper underscores the substantial advantages of scaling test-time compute tailored to software engineering tasks, outlining a powerful mechanism to manage and improve LLM capabilities within realistic software development environments. The results signify notable progress in computational strategies for systematic problem-solving in software engineering contexts.