Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

RepoForge-8B-Agent: Scalable SWE Automation

Updated 6 August 2025
  • RepoForge-8B-Agent is an 8B-parameter non-thinking language model designed for automated code repair and patching, achieving 17.4% resolution on SWE benchmarks.
  • It employs an end-to-end pipeline that integrates autonomous data curation, automated SPICE labeling, and containerized evaluation to ensure reproducible and cost-effective performance.
  • By leveraging reinforcement learning with a distributed Ray-powered harness, the system delivers significant speedups and reduces storage overhead while maintaining high evaluation throughput.

RepoForge-8B-Agent is an open-source, 8-billion-parameter software engineering (SWE) agent that defines a new state of the art for ≤8B non-thinking LLMs on repository-level automation benchmarks. Developed as the flagship outcome of the RepoForge project, RepoForge-8B-Agent is the result of a deeply optimized, end-to-end pipeline combining scalable data curation, automated label generation, distributed containerized evaluation infrastructure, and reinforcement learning (RL) at scale. It is specifically designed for code repair, automated patching, and general software engineering tasks with a focus on efficiency, reproducibility, and rapidly deployable frameworks, achieving 17.4% on SWE-Bench-Verified with substantial advances in infrastructure cost reduction and evaluation throughput (Chen et al., 3 Aug 2025).

1. Overview and Performance

RepoForge-8B-Agent is trained as an 8B-parameter "non-thinking" LLM – that is, it does not leverage explicit chain-of-thought, scratchpad, or multi-turn reasoning workflows typical of larger or "fast-thinking" models. Despite this, it achieves 17.4% resolution on SWE-Bench-Verified when evaluated under the RepoForge-OpenHands evaluation harness, and 16.4% on the official OpenHands framework. Within the model class (≤8B parameters, non-thinking), this establishes new state-of-the-art performance, competing with or surpassing much larger architectures under comparable training and evaluation regimes.

Model SWE-Bench-Verified (%) Model Size (Parameters)
RepoForge-8B-Agent 17.4 8B
Prior 8B non-thinking <<17.4 8B
Typical Large Models >17.4 >30B

This demonstrates that small models, if appropriately trained and deployed, can "punch above their weight" in demanding repository-level automation tasks.

2. Autonomous Data Curation and Environment Generation

A core innovation of RepoForge-8B-Agent is its integration within the RepoForge Foundry's end-to-end, fully autonomous data curation pipeline. The system ingests real-world GitHub repositories by extracting authentic code and commit data, inferring dependency graphs, and synthesizing executable Docker environments entirely without manual intervention. The pipeline employs a multi-agent ReAct framework (editor's term) to:

  • Scan repositories for code and contextual clues.
  • Build and validate Dockerized test environments, passing both FAIL_TO_PASS (bug → fix) and PASS_TO_PASS (consistency) tests.
  • Automatically merge, deduplicate, and strip runtime images of unnecessary components.

This results in a repository of 7,304 validated, executable environments for training and evaluation, where each instance mirrors a real, practical code patching task. The pipeline reduces the risk of overfitting to contrived data and ensures that performance metrics reflect actual SWE challenges.

3. Storage, Evaluation, and Distributed Harness

RepoForge directly addresses the storage and evaluation infeasibility that plagues container-based code agent benchmarks. Through "Image Dependency Pruning" and "Minimal Runtime Environments," shared dependencies are automatically merged and unused layers stripped, reducing average Docker image size from 1.4 GB to 102 MB—a 14× reduction. As formalized:

Image_Size_Optimized=1.4 GB14102 MB\mathrm{Image\_Size\_Optimized} = \frac{1.4~\mathrm{GB}}{14} \approx 102~\mathrm{MB}

Evaluation throughput is amplified by a Ray-powered distributed harness, employing asynchronous parallel execution through Ray actors and aiodocker. Streaming-style builds and fully asynchronous, non-blocking I/O enable more than a 70% increase in evaluation speed, with average evaluation latency reduced from several minutes to as little as 75 seconds. RL rollouts further gain up to 3× throughput, with up to 80–100 containers in parallel per host.

4. Automated Labeling with SPICE

Manual labeling of software engineering environments is prohibitively expensive and inconsistent at scale. RepoForge integrates SPICE (Structured Problem Instance Classification Engine), an automated labeling technique that quantifies task difficulty along four axes: code complexity, repository structure, test coverage, and solution archetype. SPICE labeling achieves rapid, objective task scoring at a cost approximately 19,000× lower than traditional manual annotation.

Only instances below threshold scores (e.g., all criteria ≤1) are admitted to the active training corpus, and the entire labeling process is fully integrated into both supervised fine-tuning (SFT) and RL pipelines. This ensures scalable, reproducible benchmark creation and shields modeling results from subjective or noisy labeling artifacts.

5. Reinforcement Learning and OpenHands Integration

The RepoForge-8B-Agent leverages an RL training regime built on the modified RepoForge-OpenHands framework. The RL infrastructure introduces several technical improvements:

  • Docker exec integration: Python tools execute directly in sandbox containers, eliminating the need for a persistent server daemon, reducing image size and build time by up to 80%, and delivering a 5× speedup.
  • Ray-managed remote sandboxes: Support for up to 32 concurrent agents per host, with dynamic load balancing and asynchronous I/O via aiodocker and aiohttp.
  • Dynamic producer-consumer scheduling minimizes long-tail job delays, resulting in a 1.4× boost in batch throughput.

Together, these innovations confer a 3× end-to-end speedup for multi-turn RL experimentation, enabling broader hyperparameter sweeps and more reliable convergence on challenging SWE benchmarks.

6. Systemic Contributions and Impact

RepoForge-8B-Agent exemplifies a holistic approach to scaling SWE LLMs:

  • Addressing the high storage cost of containerized evaluation by intelligent dependency pruning while maintaining strict test reproducibility.
  • Overcoming bottlenecks in reward computation and evaluation via concurrent, asynchronous harnesses, facilitating batch RL at realistic throughputs.
  • Mitigating the scarcity and heterogeneity of training resources by automated, SPICE-driven difficulty assessment and instance filtering on real code artifacts.
  • Demonstrating, through a 17.4% SWE-Bench-Verified resolve rate, that even non-thinking 8B models are capable of state-of-the-art performance when synergistically combined with data and infrastructure innovations.

A plausible implication is that further scaling of this paradigm—particularly if extended to thinking models (with explicit chain-of-thought or scratchpad mechanisms)—may shift the performance-efficiency frontier for automated software engineering agents.

7. Future Directions and Open Research Questions

RepoForge-8B-Agent sets a template for reproducible, scalable, and infrastructure-efficient SWE agent development. Open directions include:

  • Generalization to other programming languages, repository structures, or more heterogeneous CI environments.
  • Integration with advanced multi-agent collaboration, security modules (as proposed in BlockA2A (Zou et al., 2 Aug 2025)), or distributed MAS frameworks for dynamic code repair at organizational scale.
  • Extending SPICE-based automated curriculum design, enabling progressive task selection and difficulty adaptation for RL agents in non-stationary or adversarial code landscapes.

By introducing a tightly integrated pipeline from data collection to labeling, bundling, RL, and evaluation, RepoForge-8B-Agent provides a reproducible and scalable foundation for the next generation of autonomous software engineering LLMs.