Skywork-SWE Model
- Skywork-SWE is a large language model system trained on a novel, scalable pipeline using real-world GitHub issues to solve complex software engineering tasks.
- The system utilizes an automated data curation pipeline validating tasks with runtime execution and demonstrates log-linear data scaling laws for SWE agent performance.
- Released open-source, Skywork-SWE-32B achieves competitive performance on the SWE-bench Verified benchmark and provides resources for future SWE agent research and development.
- Skywork-SWE is a large language model (LLM) system and dataset designed for software engineering (SWE) tasks, featuring a scalable data curation pipeline and demonstrating data scaling laws.
- Released open-source, Skywork-SWE-32B provides a new benchmark for LLM research in software engineering and resources for scaling datasets and evaluating multi-turn LLMs.
- The system's automated data curation pipeline validates tasks with runtime execution, ensuring high-quality data, and its empirical studies reveal log-linear data scaling laws for SWE agent performance on benchmarks like SWE-bench Verified.
Skywork-SWE refers to a LLM system and supporting dataset designed for software engineering (SWE) tasks, focusing on multi-turn, long-context problem-solving using real-world GitHub Python issues. The approach introduces an automated, scalable data-curation pipeline and demonstrates pronounced data scaling laws in LLM agent performance on complex SWE benchmarks, culminating in the open-source Skywork-SWE-32B model.
1. Scalable Data Curation Pipeline
Skywork-SWE addresses the central obstacle of SWE agent research: the creation of large-scale, diverse, and runtime-validated datasets. The pipeline is a fully automated, three-stage process:
- Data Collection and Pre-filtering:
- Metadata is harvested from over 151,000 open-source repositories, focusing on high-quality, actively maintained projects.
- Pull requests (PRs) that resolve issues—identified by keywords such as 'closes' or 'fixes'—and modify test-related files are extracted.
- Repositories are reverted to the relevant commit; installation is attempted in a standardized base environment. Only instances passing installation progress.
- Execution-Based Validation and Environment Setup:
- A unified setup with Python 3.9 and essential packages is enforced, with test execution standardized via pytest.
- Three-layer Docker images are built per instance (base-level, environment-level, and instance-level), enabling reproducible, isolated execution.
- Each PR candidate undergoes "empty test" (test patch only) and "gold test" (test + solution patch) validation. Instances retained are those where patches transition at least one test from failure to pass (verifying an actual fix).
- Agent Trajectory Generation:
- Proprietary LLM agents (such as Gemini, Qwen, DeepSeek, and GPT-4.1), deployed in the OpenHands framework, iteratively solve these curated SWE tasks, generating multi-turn agent trajectories.
- Trajectories are validated via dockerized test reruns, and only those with all tests passing are selected for training.
This pipeline yields a dataset of 10,169 Python SWE task instances from 2,531 unique GitHub repositories, with each task comprising a natural language description, verified fail/pass test(s), and a dedicated runtime environment. The dataset’s temporal coverage (2013–2024) and long-tail distribution of repository sources ensure diversity and robustness.
Compared to prior datasets (SWE-bench: 2,294; SWE-bench Verified: 500), the scale and per-instance runtime validation substantially improve coverage and generalizability of SWE capabilities in LLMs.
2. Model Training and Scaling Law Analysis
The Skywork-SWE model is based on the Qwen2.5-Coder-32B-Instruct LLM and is fine-tuned in a supervised setting on over 8,000 filtered, high-quality multi-turn SWE agent trajectories. Training details include:
- Optimizer: AdamW (weight decay 0.01)
- Learning schedule: Cosine LR, peak at 5e-5
- Training hardware: 8 H800 GPUs across 3 epochs
- Context capability: Up to 32,768 tokens (32k context)
- Interaction limit: Up to 100 agent-environment rounds per instance (via OpenHands v0.32.0)
A critical finding is a robust data scaling law for SWE agent capabilities: model accuracy on SWE-bench Verified grows log-linearly with the size of curated, high-quality training data. Empirical studies show no performance saturation at over 8,000 trajectories:
where denotes validated training trajectories. The trend aligns with prior LLM scaling law literature and confirms the necessity of continuous, rigorous data scaling for SWE agent progress.
3. Benchmarking and State-of-the-Art Results
Performance is evaluated on SWE-bench Verified—a 500-task, human-validated benchmark comprising real-world bug-fix issues from diverse Python repositories. The main metric is pass@1: the proportion of tasks solved by the model’s first (and only) attempt, requiring all repository tests to pass.
Key results achieved by Skywork-SWE-32B:
Model | Params | Framework | pass@1 (%) |
---|---|---|---|
OpenHands + Qwen2.5-32B | 32B | OpenHands | 6.4 |
SWE-Gym-32B | 32B | OpenHands | 20.6 |
SWE-Dev-32B | 32B | OpenHands | 36.6 |
SWE-smith-LM-32B | 32B | SWE-Agent | 40.2 |
Skywork-SWE-32B (Ours) | 32B | OpenHands | 38.0 |
Skywork-SWE-32B (+TTS) | 32B | OpenHands | 47.0 |
- Skywork-SWE-32B achieves 38.0% pass@1, surpassing all previous Qwen2.5-Coder-32B-based models on the open-source SWE-bench Verified test without use of verifiers or multi-rollouts, and closing the gap with proprietary systems at this scale.
- Test-Time Scaling (TTS), involving independent agent rollouts per instance with OpenHands Critic reranking, boosts pass@1 further to 47.0%, which surpasses prior SOTA for models under 32B parameters.
4. Technical Implementation Details
Agent Training Objective: Standard supervised learning minimizes cross-entropy over tokenwise decisions in multi-turn trajectories:
where is the agent's action at time in the -th trajectory, conditioned on all prior context .
Evaluation Metric:
Test-Time Scaling: Performance is further optimized at inference by:
- Generating independent agent rollouts per task.
- Selecting the highest-scored solution per task using the OpenHands Critic.
- Increasing allowed rounds per agent-to-environment interaction (up to 100) enhances resolve rates and highlights the importance of multi-turn, agentic problem-solving.
5. Release and Impact
Skywork-SWE-32B and the curated dataset are released open-source with:
- Model weights and code
- The full dataset of 10,169 dockerized SWE tasks
- Recipes, setup, and environment scripts
The release provides a benchmark for open-source LLM research in software engineering, bridging gaps between open and proprietary models. The data curation and agentic evaluation frameworks serve as best-practice blueprints for future efforts to scale datasets, handle complex environments, and evaluate multi-turn LLMs.
The infrastructure supports future directions including:
- Generalization to multiple programming languages (e.g., Multi-SWE-bench)
- Reinforcement learning in programmatic environments
- Handling longer contexts (e.g., 128k tokens)
- Community-wide best practices for scalable SWE agent development and evaluation
6. Significance and Research Directions
Skywork-SWE establishes a new reference point for the practical training and evaluation of LLM-based SWE agents. Its contributions include:
- Demonstrating that log-linear data scaling laws strongly influence SWE agent performance, underscoring the continued need for more and better-curated data.
- Defining robust per-task and per-instance runtime validation standards for dataset quality and reproducibility.
- Providing comprehensive open-source resources to spur innovation and rigorous benchmarking within the research community.
- Offering insights into agentic LLM evaluation, scaling, and dataset diversity that are likely to inform the next generation of software engineering automation systems.
These developments foster broader research collaboration and set the standard for rigorous, realistic benchmark construction in the field of autonomous software engineering.