Mini-SWE-Agent: Compact Agent Architecture

Updated 15 November 2025

Mini-SWE-Agent is a lightweight software engineering agent that autonomously analyzes, reasons about, and modifies codebases for tasks like bug fixing and controller synthesis.
It features a modular design with an LLM controller, tool libraries, evaluative harnesses, and safety modules for efficient, iterative self-improvement.
The approach integrates feedback-driven diff patching and dynamic exploration, setting new baselines on SWE benchmarks and embodied task environments.

A Mini-SWE-Agent is a compact software engineering agent architecture, typically based on an open or lightweight LLM, designed to autonomously analyze, reason about, and modify codebases for real-world tasks such as automated bug fixing, test-case generation, code navigation, or, in the embodied setting, controller generation for environments like Minigrid. The term encompasses a family of systems optimized for resource efficiency, modularity, and ease of deployment, often benchmarking their capabilities against standard datasets such as SWE-Bench-Verified, LiveCodeBench, or real-world embodied benchmarks. Mini-SWE-Agents have become widely adopted reference baselines for efficient reasoning-oriented automation in software engineering research.

1. Architectural Principles and Core Design

Mini-SWE-Agents are constructed around a modular set of subsystems, each mapping to a distinct function in an agentic workflow:

LLM Controller: Invokes the underlying LLM, handles context window management, prompt injection, and output parsing.
Tool/Action Library: Provides atomic and compositional tools for file/directory operations, command execution, code editing (e.g., via diff patches), and access to sub-agent skills (such as reasoning or verification).
Test & Evaluation Harness: Runs candidate agent code or patches against problem-specific test suites, providing feedback for self-improvement or RL signals.
Orchestrator: Coordinates iterative self-improvement, manages archives of agent snapshots and utilities, selects policy candidates, and applies improvement operators based on observed utilities or meta-reflection.
Safety and Overseer Modules: Guard against runaway code edits, resource exhaustion, or undesirable and pathological actions.

As a representative formalization, the Mini-SWE-Agent self-improvement loop (Robeyns et al., 21 Apr 2025) is given by the following pseudocode:

initialize C = base_agent_code
archive = []
for t in range(max_iters):
    A = build_agent(C)
    U_current = evaluate_on_Benchmarks(A)
    archive.append((C, U_current))
    prompt = build_reflection_prompt(archive, C)
    ΔC_candidates = LLM.generate_patches(prompt)
    best_ΔU = -∞; best_patch = None
    for ΔC in ΔC_candidates:
        C_trial = apply_patch(C, ΔC)
        A_trial = build_agent(C_trial)
        U_trial = evaluate_on_Benchmarks(A_trial)
        ΔU = U_trial - U_current
        if ΔU > best_ΔU: best_ΔU = ΔU; best_patch = ΔC
    if best_ΔU > 0:
        C = apply_patch(C, best_patch)
    else:
        break
return C

Key architectural elements emphasize modularity (explicit tool interface), minimal context for in-context learning (e.g., last N observations, strict action/observation formats), token- and cost-efficient edit operators (e.g., diff-patch rather than full-file overwrite), and asynchronous orchestration.

2. Methods for Controller and Patch Generation

Mini-SWE-Agents generate code solutions in an iterative, tool-augmented reasoning framework, with methods tuned to task domain:

Embodied Controller Generation (Boulet et al., 24 Oct 2025): For tasks like Minigrid, the agent orchestrates code synthesis to output a stateless or stateful controller, typically a Python function act(obs)→action. The reasoning loop alternates between plan generation, code synthesis, static code analysis (if code access is permitted), dynamic exploration (if interactive probes are allowed), and feedback-driven code refinement. The agent operates under strict budget or trial constraints, optimizing for controller correctness and efficiency.
Software Patch Generation (Pan et al., 30 Dec 2024, Robeyns et al., 21 Apr 2025, Chen et al., 3 Aug 2025, Wang et al., 9 Jun 2025, Yang et al., 27 Sep 2025): Agents read task specifications, traverse repositories via a file/action tool interface, and propose code edits using atomic or diff-based editors. Edits are validated by executing regression test suites or synthetic checks, with feedback fueling roll-out–based fine-tuning or RL.
Verifier Assistance: Trajectories (sequences of actions and observations) are scored post hoc by a verifier-model, optimizing over best-of-N sampling schemes (pass@K) to boost the overall solve rate.

The characteristic inference workflow is exemplified as:

def best_of_n(agent, verifier, task, N=8):
    trajectories = []
    for i in range(N):
        traj = agent.rollout(task, temperature=0.5, max_turns=30)
        score = verifier.score(traj)
        trajectories.append((score, traj))
    best = max(trajectories, key=lambda x: x[0])[1]
    return best  # apply best patch

3. Training Pipelines and Data Curation

The effectiveness of a Mini-SWE-Agent is highly sensitive to pretraining and fine-tuning pipelines, including:

Supervised Fine-Tuning (SFT): On curated datasets of successful agent trajectories or code diffs. Filtering and weighting trajectories (e.g., via best@K or LLM-based scoring) are critical. SFT is typically performed with batch sizes (4–64), context windows (10k–64k tokens), AdamW or similar optimizers, and, if necessary, LoRA adapters to reduce update footprint (Pan et al., 30 Dec 2024, Chen et al., 3 Aug 2025, Wang et al., 9 Jun 2025, Yang et al., 27 Sep 2025).
Reinforcement Learning (RL): Fine-tunes the agent in environments with stepwise execution and outcome-driven reward (passing tests). RL implementations leverage asynchronous rollouts, direct Docker-exec integration, and dynamic producer–consumer queues to maximize throughput and minimize evaluation latency (Chen et al., 3 Aug 2025).
Test-Case and Evaluation Infrastructure: Automated LLM-based test-case synthesis is employed (Wang et al., 9 Jun 2025), often using multi-phase Gherkin-style descriptions and code skeletons.
Data Filtering and Evaluation: Difficulty filtering (e.g., SPICE scores), trajectory deduplication, and noise suppression with secondary LLMs for post-hoc trajectory validation are standard.
Scaling Laws: Performance (resolve rate $R$ ) increases logarithmically with the number of curated trajectories $N$ :

$R(N) \approx \alpha \log(N) + \beta$

For example, $N$ scaling from 574 to 16,639 yields $R$ from 13.0% to 22.8% for a 7B agent (Wang et al., 9 Jun 2025).

Data curation pipelines for small models may involve tighter difficulty filters, smaller image budgets, and a focus on highly reliable synthetic rollouts (Chen et al., 3 Aug 2025).

4. Baseline Performance and Scaling Behavior

Mini-SWE-Agents serve as pragmatic baselines across standard SWE agent benchmarks and embodied tasks:

Model or Setting	SWE-Bench Verified (%)	LiveCodeBench (%)	Minigrid (Navigation best@5)	Comments
7B SFT + Verifier (Pan et al., 30 Dec 2024)	13–14	—	—	Best@8 sampling, 491 SFT trajectories
RepoForge-8B (Chen et al., 3 Aug 2025)	17.4	—	—	RL-polished, SFT warm-start, ≤8B
SWE-Dev-7B (Wang et al., 9 Jun 2025)	23.4	—	—	Data scaling law, ring-attn
Minigrid MSWEA (Boulet et al., 24 Oct 2025)	—	—	0.91 (Code+Explore)	Fully obs, 20 tasks, best@5
Kimi-Dev Mini (Yang et al., 27 Sep 2025)	48.6 pass@1 (72B)	—	—	SFT agentic adaptation

Embodied Controller Generation (MSWEA) shows that:
- Test-only (no access): overall best@5 ≈ 0.13,
- Code-only: ≈0.17 (+4 points),
- Explore-only: ≈0.77 (+60 points),
- Code+Explore: ≈0.81 (+4 more points).
- Dynamic exploration delivers the dominant performance boost over static code analysis. Partial observability reduces success to ≈0.12 even with full access (Boulet et al., 24 Oct 2025).
Traditional SWE Tasks: Mini-SWE-Agents achieve 13–23% on SWE-Bench Verified with 7–8B models using curated SFT data; RL or scaling to 14–32B yields 20–36%, and 72B with sophisticated agentic adaptation attains ≈48.6% (Pan et al., 30 Dec 2024, Wang et al., 9 Jun 2025, Chen et al., 3 Aug 2025, Yang et al., 27 Sep 2025).
Resource/Compute Scaling: With 1–2 GPUs, it is possible to train or run 7–14B models with context windows of 10–16k tokens. ≤1B models, with tuned pipelines, can attain 10–12% on SWE-Bench Verified using filtered and pruned workflows (Chen et al., 3 Aug 2025).

5. Specialized Modalities: Embodied Task Reasoning

Mini-SWE-Agents adapted for embodied controller synthesis (notably Minigrid environments) demonstrate unique architectural and algorithmic requirements (Boulet et al., 24 Oct 2025):

Two-Level Agency: The "code-agent" designs a controller that itself acts as an agent within the simulated environment, implementing an act(obs) interface.
Information Discovery Modalities:
- Static Code Analysis: Symbol and docstring access for environment source code, extracting minimal configuration and encoding necessary for correct code synthesis.
- Dynamic Exploration: Probe scripts that empirically discover transition and reward mechanics, e.g., probing action mappings, interaction rules, or boundary conditions.
Access Conditions: Systematic ablation of code/exploration privileges reveals that static code analysis alone is marginally beneficial (+4–5 points), while dynamic exploration is overwhelmingly important (+60 points). Combination leads only to minor synergistic gains.
Evaluation Regimes: best@k metrics reflect strong sampling benefits but reveal plateaus for particularly difficult or memory-dependent (partially observable) tasks.

6. Practical Recommendations, Limitations, and Future Research

Implementation Guidance:

For practical deployments, restrict edits to diff-based operations; utilize tool APIs with strict output parsing to control context expansion and reduce latency or cost per action (Robeyns et al., 21 Apr 2025).
Orchestrate agentic self-improvement with verifiable benchmarks, and include safety and audit trails for deployment robustness.
Apply curriculum learning or task selection when scaling to large, heterogeneous benchmarks (Boulet et al., 24 Oct 2025).

Recognized Limitations:

Lack of generalization to richer, more complex (e.g., 3D, continuous-control) domains for embodied tasks.
No accumulation of transferable skills across tasks in current baselines.
In embodied settings, unconstrained exploration scripts may reflect impractical or unsafe assumptions for certain real-world deployments.

Future Directions:

Establishment of hierarchical skill libraries (modular sub-controller composition),
Use of adaptive curricula (e.g., OMNI, MAGELLAN) for sequencing,
Integration of explicit memory and inference modules for partial observability,
Extension to more challenging, visually-rich, or multi-modal environments to anchor generalization (Boulet et al., 24 Oct 2025).

Taken collectively, Mini-SWE-Agents serve as a reproducible, computationally accessible reference for efficient LLM-based software engineering automation, advancing empirical and theoretical understanding of reasoning under real-world constraints and providing a scalable foundation for future research in embodied and agentic AI.