Papers
Topics
Authors
Recent
Search
2000 character limit reached

RepoForge: Autonomous SWE Agent Pipeline

Updated 2 June 2026
  • RepoForge is an autonomous, fully integrated pipeline that automates data generation, curation, fine-tuning, reinforcement learning, and evaluation for SWE agents.
  • It overcomes SWE bottlenecks by combining modular components like RepoForge Foundry, SPICE-based labeling, and Ray-powered evaluation to optimize storage, latency, and labeling cost.
  • The RepoForge-8B-Agent achieves SOTA performance on SWE benchmarks, demonstrating significant improvements over non-thinking models and ensuring scalable, efficient agent training.

RepoForge is an autonomous, end-to-end pipeline designed for the generation, evaluation, and training of software engineering (SWE) LLM agents at scale. Addressing persistent bottlenecks in the field—including high storage requirements, inefficient evaluation pipelines, scarce high-quality training data, and expensive manual labeling—RepoForge delivers a scalable methodology for producing state-of-the-art fast-thinking SWE agents in the sub-8B parameter regime. Its central artifact, the RepoForge-8B-Agent, achieves a new state-of-the-art (SOTA) result of 17.4% on SWE-Bench-Verified, outperforming prior non-thinking (single-step reasoning) models of similar scale by considerable margins (Chen et al., 3 Aug 2025).

1. System Architecture and Pipeline Overview

RepoForge comprises a modular, five-stage pipeline:

  1. Automated Data Generation (RepoForge Foundry): Autonomous extraction and validation of real-world SWE tasks from GitHub commits.
  2. Data Curation & Labeling (SPICE): Automated difficulty assessment and filtering using the Structured Problem Instance Classification Engine.
  3. Supervised Fine-Tuning (SFT): Initial policy synthesis on high-quality trajcetories via cross-entropy loss.
  4. Reinforcement Learning (RL, RepoForge-OpenHands): Bubble-free, asynchronous RL training scaffold for policy refinement.
  5. Evaluation (Ray-powered Harness): Distributed, parallelized, and storage-efficient performance measurement across thousands of tasks.

Each stage features subcomponents addressing SWE-specific bottlenecks, such as containerized sandbox management, scalable data labeling, and distributed orchestration. The following table summarizes core pipeline components and optimizations:

Stage Key Technology/Technique Bottleneck Addressed
Data Generation RepoForge Foundry, static analysis Data scarcity, poor reproducibility
Curation & Labeling SPICE automated scoring Labeling cost, data quality
SFT X-entropy + rejection sampling Low-quality initial policy
RL Bubble-free RL scaffold, async I/O RL latency, pipeline stalling
Evaluation Ray, Docker caching Evaluation latency, storage

2. Automated Data Generation and Environment Pruning

RepoForge Foundry autonomously ingests large volumes of GitHub commits flagged as 'fix' or 'bug,' reconstructs build graphs via hybrid static analysis and heuristics, and emits validated Dockerfile+patch pairs. This facilitates emission of 7,304 fully executable environments representing real-world SWE workflow corrections, each validated end-to-end with pre- and post-patch test suites. Two test case sets are generated per environment (“FAIL_TO_PASS” and “PASS_TO_PASS”) by executing golden patches.

Dependency Analysis and Pruning achieves a 14× reduction in per-instance storage (1.4GB original images to 102MB pruned), supporting 7,304 tasks with only 937 optimized images and reducing peak RL disk usage from ~1TB to ~70GB. The approach merges container images sharing overlapping dependencies, rebasing and stripping redundant layers while retaining only compilers, runtimes, and frameworks required for each instance. Minimal base images (∼50MB) are produced and further stripped of caches and debug symbols, yielding substantial storage efficiency.

3. SPICE-Based Difficulty Assessment and Curation

Structured Problem Instance Classification Engine (SPICE) evaluates each data instance across four normalized dimensions: complexity (cyclomatic via static analysis), structure (graph-based repo metrics), coverage (test instrumentation), and patterns (patch diff clustering). The aggregate instance difficulty is

D(i)=k=14wkSk(i),wk=1D(i) = \sum_{k=1}^4 w_k S_k(i), \quad \sum w_k = 1

with tasks retained for SFT if D(i)0.25D(i) \leq 0.25.

SPICE delivers an automated curation mechanism that eliminates the need for costly manual annotation, reducing labeling cost by a factor of 19,600 (manual: ~$15K/1K labels, SPICE: ~$0.75/1K labels). On SPICE evaluation, agreement with humans is 87.3% for clarity and 68.5% for coverage. The pipeline distilled 1,202 SFT-quality instances for supervised training from the 7,304 raw tasks.

4. Distributed Evaluation Harness

RepoForge’s Ray-powered evaluation harness utilizes a master–worker actor paradigm, deploying up to 64 Ray actors for concurrent containerized execution. Each actor handles image pull, test execution, log streaming, and reward emission in a non-blocking manner (aiodocker, aiohttp), overlapping build, install, and test phases for throughput maximization. A centralized image cache on NFS/OSS further accelerates image builds (mean: 537s → 58s) and test evaluation (mean: 56s → 17s), resulting in >70% speedup and 96% of instances completed within 120 seconds.

Collected evaluation metrics include:

  • Execution success rate (container uptime)
  • PASS_TO_PASS and FAIL_TO_PASS outcomes
  • Wall-clock latency
  • Resource utilization (CPU, disk I/O)

5. Agent Training Methodology

RepoForge employs a multi-stage training regime:

Supervised Fine-Tuning (SFT)

SFT is performed on the SPICE-filtered 1,202 tasks, each with eight candidate patches. The loss is standard cross-entropy over tool-invocation tokens (including the 'finish' token), with rejection sampling to retain only candidates passing both FAIL_TO_PASS and PASS_TO_PASS test suites. Training runs in a single epoch, batch size 4, learning rate 1×1051 \times 10^{-5}, and maximum sequence of 2048 tokens.

Reinforcement Learning (RL)

The RL phase leverages the RepoForge-OpenHands scaffold—an asynchronous, bubble-free architecture that eliminates rollout pipeline idle time via producer-consumer queues and fully async I/O. RL uses PPO-style updates with KL-penalization (temperature 0.5, λ=1e–3, trajectory max length 35), operating over 160 SFT-solved seeds for 40 RL steps × 2 epochs.

Rollouts invoke three tools: execute_bash(cmd), str_replace_editor(file, old, new), and finish(). Reward is sparse, with r(τ)=1r(\tau) = 1 only if all tests pass at the end of the trajectory, 0 otherwise. RL is warm-started with the SFT policy, freezing the tokenizer and base layers initially and then progressively unfreezing for adaptation.

6. Experimental Results and Comparative Analysis

RepoForge-8B-Agent (SFT+RL) achieves 17.4% accuracy using the RepoForge OpenHands harness and 16.4% on the official OpenHands harness for SWE-Bench-Verified evaluation. An SFT-only baseline achieves 12.7% and 10.5%, respectively. Compared to prior ≤8B, non-thinking SOTA agents:

Agent RepoForge OH (%) Official OH (%)
RepoForge-8B 17.4 16.4
R2EGym-7B 9.6 10.2
Qwen3-8B 5.0 4.2
Seed-Coder-8B 9.2 4.6

Ablation studies show a +4.7 percentage point gain from RL over SFT-only, and a +1.8 percentage point improvement due to harness optimization. Increased SFT data size led to overfitting and reduced generalization accuracy.

RepoForge effectively addresses key bottlenecks: storage usage (1.4GB→0.102GB per image), evaluation latency (56s→17s), labeling cost ($15,000→$0.75 per 1K labels), pipeline throughput (3× increase via bubble elimination), and data scarcity (500 manual→7,304 auto-generated tasks).

7. Key Insights, Limitations, and Future Directions

RepoForge demonstrates that integrated, end-to-end infrastructure and data curation are critical factors in achieving SOTA with sub-8B LLMs on realistic SWE benchmarks. SPICE-based task selection yields a fourfold SFT gain by prioritizing data quality over quantity. SFT warm-start is essential in sparse-reward RL contexts. Model size imposes overfitting concerns requiring conservative dataset sizing, and performance is highly sensitive to infrastructure-level choices.

Planned extensions to RepoForge include multi-language task support (Java, C++, Rust), integration into continuous learning loops (deploy → collect fixes → retrain), hierarchical RL for long-horizon code planning, and embedding into production CI/CD workflows for developer assistance. The system paves the way for modular, low-cost, and scalable SWE agent training pipelines suitable for industrial application (Chen et al., 3 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RepoForge.