Papers
Topics
Authors
Recent
2000 character limit reached

Mini-SWE Agent: Lightweight Code Repair Agent

Updated 16 December 2025
  • Mini-SWE Agent is a lightweight, language-model-based autonomous tool that performs interactive edit–test–fix cycles with minimal parameter counts.
  • The architecture employs a sandboxed Docker environment and an LLM with extended context to streamline rapid command execution and feedback loops.
  • Performance benchmarks highlight high energy efficiency and scalability, while extensions enable self-reflection and dynamic tool augmentation for improved outcomes.

A Mini-SWE Agent is a lightweight, language-model-based autonomous software engineering agent designed for end-to-end code reasoning, tool invocation, and issue resolution at small or modest parameter counts. These agents, typically parameterized at 1–8B scale, instantiate the core agentic SWE paradigm—interactive edit–test–fix cycles—while minimizing orchestration and hardware requirements. Mini-SWE Agent designs are now widely used as reference baselines in software-automated issue resolution, embodied task solving, and resource-constrained deployment research. Their simplicity, transparency, and extensibility make them a locus for benchmark-driven method development and empirical comparisons across frameworks and hardware (Tripathy et al., 10 Dec 2025, Xia et al., 17 Nov 2025, Wang et al., 9 Jun 2025, Pan et al., 30 Dec 2024, Chen et al., 3 Aug 2025, Yang et al., 27 Sep 2025, Boulet et al., 24 Oct 2025).

1. Architectural Simplicity and Operational Loop

Mini-SWE Agents are defined by a minimalist execution-flow centered on shell-level primitives and rapid LLM–environment feedback. The canonical architecture consists of:

  • A sandboxed Docker container exposing only a bash REPL or minimal subset of Unix tools (e.g., bash, grep, sed).
  • A parametric SLM (Small LLM, e.g., Gemma-3 4B or Qwen-3 1.7B) loaded with an extended context window (often ≥32k tokens).
  • A stateless orchestration loop in which each LLM completion produces a bash command, executed stepwise, with stdout/returncode appended to the conversation context and fed back to the model.

The agent's main control sequence can be expressed as follows:

  1. Initialize with the issue description and file listing.
  2. On each turn, prompt the SLM for "thought" plus bash command.
  3. Execute command in the sandbox, capture output/returncode.
  4. Append the results to the conversation.
  5. Iterate (ReAct-style) until either all tests pass, a token budget is exhausted, or a wall-clock timeout occurs.

A distinguishing feature is the absence of hard-coded tool lists, external planning modules, or verification scaffolds: all logic is internalized by the SLM’s own completions. In (Xia et al., 17 Nov 2025), the Mini-SWE-Agent is implemented in approximately 100 lines of Python; agent extensions, such as Live-SWE-agent, add prompt-level “reflection” and tool-creation blocks without architectural elaboration.

2. Training Pipelines and Data Strategies

Mini-SWE Agent efficacy depends critically on carefully curated and filtered training data, precise trajectory sampling, and robust fine-tuning regimes:

  • Test-Case Synthesis: Training begins with conversion of real GitHub issue–patch pairs into “fail-to-pass” test cases: issues are extracted from repositories (≥5 stars, ≥3 PRs), described in minimal Gherkin syntax via LLM prompting, converted to Pytest unit tests by code-specialist LLMs, and filtered to ensure failing on unpatched code and passing on patched code (Wang et al., 9 Jun 2025).
  • Trajectory Rollout: Starting from hand-picked (or seed) successful trajectories (prompt→action→patch), mass trajectory generation is performed by prompting a base LLM agent at sampling temperature T≈0.7, top-k=50, nucleus p=0.9, to yield a diverse bank of trajectories per instance.
  • Critical Filtering: Subsampled trajectories are subject to success/failure-based acceptance, early-stopping filters (≤30 steps), and LLM-based patch critics to retain only those matching or closely resembling the golden patch.
  • Fine-Tuning: Supervised loss is computed over successful (and possibly some penalized unsuccessful) trajectories. For instance, Qwen2.5-Coder-7B is fine-tuned for 4 epochs at learning rate 1e-5, batch size 32, with linear warmup and decay, weight decay 0.1, optimizer AdamW (Wang et al., 9 Jun 2025).

Alternative training paradigms leverage RL (as in RepoForge’s “bubble-free” RL pipeline), curriculum learning ordered by task difficulty, and trajectory “mix-up” for synthetic training example interpolation (Chen et al., 3 Aug 2025).

3. Evaluation, Benchmarks, and Empirical Results

Evaluation of Mini-SWE Agents is standardized on public, human-validated task suites such as SWE-bench Verified (Wang et al., 9 Jun 2025), SWE-Gym (Pan et al., 30 Dec 2024), and domain-adapted sets for embodied controllers (Boulet et al., 24 Oct 2025). Key metrics include:

Model/Framework Parameter Count Benchmark Resolve Rate / pass@1 (%) Remarks
Qwen2.5-Coder-7B 7B SWE-bench-Verified 23.4 (Wang et al., 9 Jun 2025)
RepoForge-8B-Agent 8B SWE-bench-Verified 17.4 (Chen et al., 3 Aug 2025)
Mini-SWE-Agent (Gemma-3) 4B SWE-bench-Verified Mini 0.0 (Tripathy et al., 10 Dec 2025)
Mini-SWE-Agent (Qwen-3) 1.7B SWE-bench-Verified Mini 0.0 (Tripathy et al., 10 Dec 2025)
GPT-5-Mini (mini-SWE) SWE-bench-Verified 59.8 (Xia et al., 17 Nov 2025)
Kimi-Dev Agentic SFT 7B SWE-bench-Verified 48.6 (pass@1) (Yang et al., 27 Sep 2025)

Statistics such as bootstrap confidence intervals and error bars are commonly reported. For example, Qwen2.5-Coder-7B achieves a 95% CI of [21.2%, 25.6%] for the resolve rate (Wang et al., 9 Jun 2025).

Agent design choices—such as trajectory diversity, success-based trajectory filtering, tester/patch verifiers, and curriculum strategies—significantly affect the pass rates. Larger models and increased trajectory sample sizes yield higher rates, but storage and compute overheads scale rapidly (Wang et al., 9 Jun 2025, Chen et al., 3 Aug 2025).

4. Energy, Resource, and Efficiency Considerations

Energy usage and resource efficiency of Mini-SWE Agents have been benchmarked against more complex frameworks (Tripathy et al., 10 Dec 2025):

  • Empirically, Mini-SWE Agents exhibit the lowest absolute energy consumption among SLM-powered frameworks; e.g., mean energy per run is 23.41 kJ (Gemma-3 4B), and 54.13 kJ (Qwen-3 1.7B), with corresponding run times of 3.20 min and 8.54 min.
  • However, these numbers are skewed by high rates of early failure, unproductive loops, or oversize prompt emissions—operations that terminate quickly but yield no successful patch.
  • Without loop detection, context filtering, or basic verification, repeated failed reasoning steps accumulate until budget or timeout is reached.
  • The main efficiency bottleneck for Mini-SWE Agent designs is the SLM's intrinsic reasoning capacity, not orchestration overhead. This suggests the need for lightweight loop-breaking protocols and context slicing for genuine progress in energy-constrained environments.

Best practices include adding dynamic loop-breaking (e.g., by tracking repeated command sequences), pre-execution syntax or diff checkers, and context selection strategies to restrict prompt growth (Tripathy et al., 10 Dec 2025).

5. Extensions and Self-Evolving Variants

Mini-SWE Agent scaffolds serve as the basis for advanced agentic frameworks:

  • Live-SWE-Agent extends Mini-SWE Agent by allowing dynamic, on-the-fly augmentation of available tools during a run. The model is informed it may create first-class Python tools, which are written to disk and invoked by subsequent shell commands. Interleaved “reflection” prompts after every step encourage self-inspection (tool invocation, trajectory pruning).
  • This yields an absolute boost of 3–5 percentage points in solve rate across benchmarks, achieving 75.4% on SWE-bench Verified and 45.8% on SWE-Bench Pro with Claude 4.5, outperforming static scaffolds and matching or surpassing leading proprietary models (Xia et al., 17 Nov 2025).
  • Self-evolving approaches do not rely on external reward functions or performance evaluations during offline training: the only reward is test script status and the agent’s own trajectory reflection.
  • Empirical evidence demonstrates that such scaffold extensibility enables zero-offline-cost performance gains and improved tool discovery without requiring large-scale retraining.

A plausible implication is that minimal agents buttressed by strong self-reflective and creation cues can match far more complex, tool-rich architectures under sufficient LLM capacity.

6. Adaptation to Embodied and Non-Standard Domains

Mini-SWE Agent frameworks have been generalized to non-traditional software engineering tasks—such as controller synthesis in embodied, MDP-based settings (Boulet et al., 24 Oct 2025):

  • The MSWEA framework leverages a two-level architecture where an LLM code-agent autonomously generates and iteratively refines Python controllers for embodied environments (e.g., Minigrid), with evaluation conducted over best@k policy performance.
  • Information access ablation studies reveal that interactive, dynamic probing of the environment recovers ∼80% of full-performance, whereas static source code inspection provides only marginal gains.
  • In fully observable settings, best@5 success rates for navigation tasks reach 92%; manipulation, hazard, and memory tasks follow similar trends but reveal greater difficulty in partial observability.

This demonstrates that the minimalist agentic approach, originally focused on code edits and bash orchestration, transfers well to reinforcement learning–like controller synthesis domains when equipped with appropriate shell-level environment access.

7. Practical Implementation Guidelines and Limitations

Key steps and recommendations for building a Mini-SWE Agent include:

  • Leverage open-source LLMs (Qwen-2.5, Llama-2, Gemma) at 7B scale with FP16 or 8-bit quantization.
  • Use curated subsets of benchmarks (e.g., SWE-Gym Lite) for rapid iteration and fine-tuning.
  • Prefer trajectory-level rejection sampling, with high-quality test-case curation for labeled supervision.
  • Incrementally extend the scaffold only when bottlenecks (e.g., excessive looping, false positives) are empirically evident.
  • Monitor limitations: SLMs without specialist pretraining may fail in complex reasoning tasks; absence of agent-aware controls leads to cycles and prompt overflow; fine-tuning data quality and diversity are critical for robust generalization.

Recent research indicates that the pressing frontier for Mini-SWE Agent advancement lies in composing hybrid approaches (agentic plus agentless training, as in Kimi-Dev (Yang et al., 27 Sep 2025)), integrating verifier-based inference, and continuous live adaptation.


References

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Mini-SWE Agent.