SWE-Agent Framework & SWE-Search
- SWE-Agent Framework is an autonomous system that uses LLMs, integrated toolchains, and structured decision processes to automate complex software workflows on repositories.
- It features a multi-agent architecture—with an Action Agent, Value Agent, and Discriminator Agent—that employs MCTS and iterative self-feedback to optimize code modifications.
- Empirical results demonstrate a 23% relative improvement on Pass@1 benchmarks, highlighting the benefits of systematic search and collaborative solution selection.
A Software Engineering Agent ("SWE-Agent") framework refers to any autonomous system that leverages LLMs, integrated toolchains, and structured decision processes to automate complex software engineering workflows on source repositories. Among these, the SWE-Search framework stands out for its principled, non-linear, and self-improving architecture for repository-level code tasks, featuring multi-agent collaboration, Monte Carlo Tree Search (MCTS), iterative value-based feedback, and multi-agent debate mechanisms. SWE-Search establishes a new paradigm by coupling systematic search and dynamic self-critique, yielding substantial improvements over traditional sequential LLM agents on repository-scale software engineering benchmarks (Antoniades et al., 2024).
1. Architectural Foundations and Multi-Agent Design
SWE-Search is structured as a unified, multi-agent system composed of three specialized modules:
- SWE-Agent (Action Agent): Maintains a flexible state-space over a git-style commit tree. It can plan, search, edit, and execute tests in arbitrary order, invoking MCTS at every decision point to select from a high-level action set (e.g., “search for relevant files,” “edit this function,” “write a test”).
- Value Agent: For any state-action pair , plus the full prior trajectory , it returns both a scalar utility estimate (quantifying downstream reward projection) and a qualitative natural-language explanation (rationalizing the numeric value and highlighting reasoning for next steps).
- Discriminator Agent: At termination (MCTS leaf nodes or on exceeding episode limits), up to final patch candidates are passed to this agent, which runs a -round structured debate. Each sub-agent presents and critiques patches, after which a judge agent selects the final solution.
This modular orchestration underpins a full decision pipeline: the Action Agent plans, the Value Agent evaluates and critiques, and the Discriminator Agent adjudicates among plausible candidate solutions.
2. MCTS Planning Integration and Decision Loop
MCTS is employed at each high-level decision point to balance exploration of alternative code modification pathways with exploitation of promising trajectories.
Tree Construction:
- Nodes encode repository states (obtained via git-like commit objects).
- Edges are high-level actionable commands.
A depth-augmented variant of Upper Confidence Bound for Trees (UCT) guides node selection:
where is the estimated value from the Value Agent, is the exploration constant ($1.41$ by default), and the additional , , terms modulate the utility based on node depth to prefer meaningful expansions and avoid pathological deepening or premature cutoff.
Workflow Synopsis (Pseudocode):
- Start at root with initial state .
- Select child nodes using UCT until a terminal or expandable node is reached.
- Expand up to new actions from the Action Agent, instantiate children.
- Evaluate expanded leaf nodes using the Value Agent .
- Backpropagate up the tree via ancestor path, updating visit and cumulative value counts.
- On completion of allotted iterations, return the child of with the most visits as the agent's chosen next action.
3. Hybrid Value Function and Self-Feedback
The hybrid value function is central to SWE-Search’s efficacy:
- quantifies expected future reward (e.g., all tests passing), shaping MCTS utility propagation.
- is a qualitative, natural-language explanation of the action’s expected impact and potential shortcomings.
This hybrid signal not only supports transparent selection but enables iterative, context-sensitive correction: during expansion, may identify missing test cases or erroneous assumptions, prompting targeted re-expansion of earlier nodes.
The overall agent objective is
where is a sparse pass/fail reward (1 if all tests pass, 0 otherwise) with optional shaping for high-potential edits.
4. Iterative Self-Improvement and Hindsight Expansion
Iterative self-feedback is operationalized via the Value Agent’s natural-language critiques. If a trajectory fails due to untested edge cases or misapplied logic, triggers the parent node to formulate new targeted actions (e.g., writing an additional test), updates its prompt with clarified instructions, and recursively revisits the search space.
This “hindsight expansion” avoids blind rollouts and closely mirrors the iterative test/code/debug cycles of experienced human engineers. The process continues until either the revised patch passes all tests or search/inference budgets are exhausted.
5. Collaborative Solution Selection: Multi-Agent Debate
Upon completion of the MCTS phase, SWE-Search often possesses multiple high-quality candidate patches. To robustly select a final fix, these candidates undergo multi-agent debate:
- Up to five agents each present an advocacy statement for their candidate patch, citing code changes and test outcomes.
- For three rounds, agents critique competitors’ patches, highlighting weaknesses or overlooked issues.
- A “judge agent” applies a final assessment and selects the most compellingly defended patch.
Empirically, the debate protocol increases final patch correctness from approximately 73% (Value Agent’s top prediction) to 84%, indicating substantial benefit from collaborative adversarial selection.
6. Hyperparameters, Compute Scaling, and Empirical Results
Key MCTS-related hyperparameters (default values):
- (UCT exploration parameter)
- max_expansions per node = 5
- max_iterations = 100
- provide_feedback = True (enables hindsight re-expansion)
- max_depth = 20
- value_function_temperature = 0.2
Performance scaling is a core aspect. SWE-Search yields a consistent ≈23% relative improvement in Pass@1 (top-1 success rate) on the SWE-bench benchmark, across five tested models including GPT-4o, Qwen2.5, and Llama-3.1, compared to structurally similar agents lacking MCTS. As the number of MCTS iterations increases (up to 100), resolved issues accumulate steadily, underscoring robust inference-time scaling—improvement is achieved not through addition of model parameters or external data, but through more intensive, reflective search (Antoniades et al., 2024).
7. Repository-Level Implementation and Practical Considerations
SWE-Search is instantiated over real-world codebases with the following engineering strategies:
- State Serialization: Each search node maps to a git commit; the environment can efficiently backtrack by “git checkout” to previous nodes.
- Prompt Sharding: Repository context is split into semantic “spans” (imports, functions, classes) to bound LLM prompt sizes; the Action Agent can dynamically request new spans as required.
- Isolated Testing Environments: Test execution is containerized (Docker or Kubernetes); the agent can run
pytestor similar but is unaware of which tests fail, matching real-world engineering workflows. - System Foundation: SWE-Search builds atop the moatless-tools open-source framework, extending it from a finite-state agent to a full tree search system.
This careful environment design ensures scalability to large repositories, support for backtracking, and interpretability of agent decisions and trajectories.
In summary, SWE-Search advances the field of LLM-based software engineering automation by integrating multi-agent tree search, hybrid quantitative/qualitative value estimation, iterative self-correction, and collaborative debate selection. By replacing predominantly linear, myopic agentic workflows with systematic search and self-improvement, SWE-Search achieves higher repository-level task success without additional model training or parameter growth, providing a new blueprint for generalizable, robust software agent frameworks (Antoniades et al., 2024).