Trae Agent: Repository-Level Issue Resolution
- Trae Agent is an LLM-based, agent-oriented ensemble reasoning system that resolves repository-level software issues by searching for optimal code patches.
- It employs a modular architecture with dedicated sub-agents for patch generation, hierarchical pruning, and rigorous selection to enhance accuracy and scalability.
- Evaluations on the SWE-bench Verified benchmark show a significant Pass@1 improvement, with a mean uplift of 10.22 percentage points over traditional methods.
Trae Agent is an LLM-based, agent-oriented ensemble-reasoning system designed for repository-level software issue resolution. It operationalizes the problem as an optimal solution search over candidate code patches, leveraging modular sub-agents dedicated to patch candidate generation, hierarchical pruning, and rigorous selection. Trae Agent addresses the limitations of prior LLM prompting-based ensemble methods—specifically, the inability to efficiently scale ensemble search and incapacity for deep repository-level cross-file understanding—thus establishing a new state of the art on the SWE-bench Verified benchmark with a Pass@1 of 75.20% and a mean uplift of 10.22 percentage points compared to competing baselines (Team et al., 31 Jul 2025).
1. Formal Problem Setup and Solution Space
Trae Agent frames repository-level software issue resolution as an optimal solution search:
- Given: a codebase , a natural-language software issue , and a test suite .
- Patch generation: candidate patches are generated as with each .
- Selection objective: Find such that applying to maximizes the probability that passes the test suite 0.
- Evaluation Metric: 1 is defined as
2
where 3 is the selection function and the expectation is over sampled patch ensembles.
This solution space is typically restricted by 4; scalability is managed at inference via the ensemble size.
2. Modular Agent Architecture
Trae Agent comprises three interconnected modular sub-agents:
- Generation Agent ("Coder Agent"):
- Inputs: 5.
- Utilizes LLMs coupled with a toolset (file-edit, bash, sequential-thinking, done-signal) to sequentially (1) analyze the issue, (2) localize code, (3) replicate/reproduce the fault, (4) diagnose, (5) synthesize candidate patch, (6) re-test, and (7) compose commit summary.
- Operates at high sampling temperature and optionally round-robins across multiple LLM backbones (Gemini 2.5 Pro, Claude 3.7 Sonnet, GPT-4.1), yielding diverse solution candidates.
- Pruning Agent:
- Deduplicates patches (strip whitespace/comments via unidiff, remove syntax/semantically-equivalent forms).
- Applies regression testing by extracting existing passing regression tests, which are further refined via an LLM-based tester agent; discards any patch failing these tests.
- Delivers a reduced ensemble 6 (typically 30% smaller), which is empirically likely to retain the correct solution.
- Selection Agent ("Selector Agent"):
- Consumes 7 and operates with repository-level read-only access and a test executor.
- Implements iterative repository context enrichment: static review (import graphs, dependency mapping, context diffs), dynamic verification (on-the-fly unit test generation/execution), up to 30 prompt-tool rounds.
- Each candidate is voted for/against; early stopping is triggered at majority (8). Final selection is determined by majority consensus across multiple LLM voter runs.
This architecture allows cascading reductions in candidate space, precise fault localization, and robust selection in complex multi-file repositories.
3. Test-Time Scaling and Resource Adjustability
Trae Agent enables test-time scaling without model retraining:
- Ensemble size 9 can be tuned for a linear increase in accuracy as 0 increases (demonstrated up to 1).
- Supply diversity by adjusting the generation agent's sampling temperature and interleaving different LLM backbones.
- Optionally implement beam search variants, although primary experiments use sampling-based diversity.
- Empirically, prompting-only baselines show non-monotonic scaling—peaking at 2—while Trae Agent sustains monotonic accuracy gains with increasing 3.
This test-time scaling property directly translates to controllable accuracy-cost trade-offs.
4. Repository-Level Contextual Reasoning
Trae Agent’s selection agent achieves repository-level understanding through:
- Cross-file static analysis using a File-Editing tool, traversing import chains and transitive dependency graphs (parsed via AST/imports).
- Asynchronous summarization ("lakeview" summarizer) maintains a succinct, dynamically-updated LLM memory.
- Dynamic analysis by generating and executing novel unit tests on each candidate, capturing behavioral traces.
- Repetitive, structured prompt-tool interactions (up to 30 rounds) ensure deep semantic and behavioral congruence checks between patch and target issue.
This capability is central to scaling issue resolution from file-level tasks to realistic, multi-file, dependency-rich repositories.
5. Experimental Results and Ablation Studies
Evaluations were conducted on the SWE-bench Verified benchmark (500 curated GitHub issues):
- Backbones: Gemini 2.5 Pro, Claude 3.7 Sonnet, GPT-4.1; default reports use Claude 3.7 Sonnet.
- Metric: Pass@1 for repository-level fixes.
- Key Outcomes:
- Trae Agent (N=3) achieves 66.40% Pass@1, outperforming the best ensemble baseline (Augment + Pruning, 64.33%) by +2.07 points.
- Across all baselines (including DeiBase, with/without pruning), mean uplift is +10.22 pp.
- “Mixture” mode (round-robin 3 LLMs): 65.67% (oracle upper bound: 73.40%).
- Public leaderboard: 75.20% Pass@1 (first place).
- Ablations (Claude 3.7): No pruning (–6.80 pp), no deduplication (–2.73 pp), no regression test filtering (–3.32 pp), prompt-only selector (–4.00 pp), no selector voting (–2.80 pp).
- Monotonicity: Pass@1 increases with 4 for Trae Agent, contrary to prompting baselines, which saturate or regress at higher 5.
- Correlation: After pruning, smaller ensembles (i.e., less redundant, more curated) are highly correlated with higher Pass@1 (Pearson 6).
6. Open-Source Availability and Reproducibility
- Full implementation, including multi-LLM support, tool wrappers, agent prompts, and evaluation scripts, is open-sourced at https://github.com/bytedance/trae-agent.
- Containerized (Docker) scripts support end-to-end experiment re-runs and leaderboard reproduction.
- Modular codebase enables external adaptation and benchmarking.
7. Significance and Implications
Trae Agent is the first system to instantiate agent-based ensemble reasoning for end-to-end, repository-level software issue resolution. Its hierarchical agent decomposition—generation, pruning, selection—enables efficient search over large solution spaces, deep repository understanding, and a test-time scaling interface, resulting in robust improvements over prior ensemble prompting frameworks. Key implications are:
- The hierarchy and diversity management enable sustained accuracy benefits as ensemble size increases.
- Agent-based modularization facilitates extensibility (e.g., new pruning criteria, alternative static/dynamic analyses).
- Trae Agent’s method provides an empirical foundation for further systematic investigation of agentic decomposition and ensemble reasoning in software engineering tasks.
A plausible implication is that similar agent-based, modular ensemble architectures could generalize to other complex, open-ended reasoning domains requiring both semantic synthesis and behavioral validation (Team et al., 31 Jul 2025).