Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training Software Engineering Agents and Verifiers with SWE-Gym (2412.21139v2)

Published 30 Dec 2024 in cs.SE and cs.CL

Abstract: We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train LLM based SWE agents, achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories.

Summary

  • The paper’s main contribution is the development of SWE-Gym, a platform with 2,438 Python-based real-world tasks and executable test frameworks.
  • The methodology combines supervised fine-tuning, trajectory sampling, and verifier training to enhance task resolve rates by up to 19%.
  • The study demonstrates that scaling training trajectories and inference compute effectively sets a new benchmark for open-weight software engineering agents.

Overview of "Training Software Engineering Agents and Verifiers with SWE-Gym"

The paper introduces SWE-Gym, an innovative environment tailored for training software engineering (SWE) agents using real-world tasks. This marks a significant advance in providing a structured platform to not only train LLM (LM)-based agents but also to explore the role of verifiers in enhancing these agents' performance in software engineering contexts. The framework focuses on addressing the limitations faced by existing LM-based SWE agents, primarily constrained by their reliance on proprietary models and lack of suitable training environments.

Key Contributions

The primary contribution of this research is the development of the SWE-Gym environment. It comprises 2,438 real-world Python-based task instances that include codebases with executable environments and defined unit tests. This setup allows for practical training and testing of SWE agents in a realistic setting, moving beyond isolated coding tasks toward comprehensive software engineering challenges.

One of the haLLMark experiments in the paper uses SWE-Gym to train SWE agents, emphasizing significant improvements in resolve rates on the SWE-Bench Verified and Lite test sets, achieving enhancements by up to 19%. By integrating verifiers that assess agent-generated solutions, the proposed system records further improvements, establishing a new benchmark for open-weight SWE agents.

Technical Advancements

  • Training SWE Agents: The authors utilize SWE-Gym for training agents through a process of supervised fine-tuning and inference-time scaling. They present a straightforward trajectory sampling technique, which, when fine-tuned with models like Qwen-2.5, leads to a marked uptick in task resolve rates.
  • Verifier Training: An innovative aspect of the paper is the training of verifier models. These verifiers, trained on trajectories gathered through SWE-Gym, allow for improved inference by facilitating more effective solution-selection mechanisms. The verifiers enable significant gain in resolving complex tasks by selecting the most promising trajectories from multiple generated paths.
  • Scalability: The research demonstrates that scaling trajectories in training and scaling compute at inference time consistently enhance performance. This scalability is pivotal in reaching improved resolve rates without apparent saturation, suggesting that further scaling could still yield performance gains.

Implications and Future Directions

The implications of this research are twofold. Practically, SWE-Gym provides a replicable and accessible platform for further development of robust SWE agents capable of addressing real-world tasks, benefiting industries reliant on scalable, automated code generation and maintenance. Theoretically, the versatility of SWE-Gym as a training benchmark supports broader studies into the interplay between SWE agents and verifiers, a promising direction for achieving more autonomous and intelligent systems.

Future developments could expand the SWE-Gym dataset and further optimize agent-verifier dynamics, enhancing performance scalability. The paper hints at employing more sophisticated policy optimization techniques like Proximal Policy Optimization (PPO) as a future step, suggesting the potential for even deeper integration of reinforcement learning methodologies.

Conclusion

In conclusion, the introduction of SWE-Gym represents a substantive contribution to the field of software engineering automation. It offers a robust foundation for further advancements in training SWE agents equipped with verifiers, underlining an opportunity to shift from proprietary model-based enhancements to more accessible, model-agnostic improvements. This work lays the groundwork for ongoing research into scalable agent training and inference methodologies, essential for the future of AI-driven software development.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com