SWE-Effi: Resource Effectiveness Metrics
- SWE-Effi is a framework that quantifies software AI agent effectiveness under resource constraints using normalized, AUC-based metrics.
- It benchmarks agents by tracking tokens, cost, CPU, and inference time to re-rank leaderboards and expose trade-offs in efficiency.
- Key findings reveal token snowball effects and budget trade-offs, emphasizing the synergy between agent scaffolds and LLM backends.
SWE-Effi refers to the class of metrics, benchmarks, and methodologies that jointly quantify the effectiveness of software AI agents under resource constraints, particularly in the context of LLM-driven code agents and software engineering (SWE) tasks. As LLM agents for repository-level code modifications mature, traditional leaderboards focused exclusively on resolution or correctness metrics have proven insufficient for assessing agents’ real-world utility. SWE-Effi frameworks address this gap by integrating both outcome quality (e.g., resolve/pass@1 rates) and multidimensional resource consumption (tokens, time, cost)—enabling principled, reproducible evaluation of agentic systems under practical deployment scenarios (Fan et al., 11 Sep 2025).
1. Formalization of Resource Effectiveness Metrics
SWE-Effi introduces the concept of “resource effectiveness” (RE), a normalized, area-under-curve (AUC)-based family of metrics that jointly consider (a) the cumulative resolve rate as a function of incrementally expended resource and (b) a fixed resource cap relevant to real-world deployments. Let be the number of issues, a resource (tokens, cost, CPU or inference time), the cumulative fraction of issues resolved using at most units, and the budget cap. The RE metric for resource is:
Reported values lie in (often expressed as percentages). Four principal SWE-Effi metrics are defined:
- Effectiveness under Token Budget (EuTB): = total tokens ()
- Effectiveness under Cost Budget (EuCB): = dollar cost ()
- Effectiveness under CPU Time Budget (EuCTB): = CPU seconds ()
- Effectiveness under Inference Time Budget (EuITB): = normalized LLM inference seconds ()
This formulation inherently weights early successes more: quickly solved issues contribute greater area under the curve, while late or no solutions contribute little. A linear regression normalizes inference time to remove hardware variance (e.g., ) (Fan et al., 11 Sep 2025).
2. Benchmarking Methodologies and Experimental Design
SWE-Effi’s empirical analysis uses realistically sized GitHub issue pools (e.g., 50 stratified SWE-bench-Verified issues) and evaluates a spectrum of agentic scaffolds (AutoCodeRover, OpenHands, SWE-Agent, Agentless Pipelines) paired with diverse LLM backends (e.g., GPT-4o-mini, Llama, Qwen3-32B). For each agent/model configuration:
- Per-issue resource consumption is meticulously tracked (tokens, wall-time, inference time, dollar cost).
- Each effectiveness metric () is computed up to the corresponding resource cap.
- Leaderboards are re-ranked for each dimension, exposing trade-offs and cases where token-efficiency diverges from time- or cost-efficiency.
The evaluation strictly controls for agent parallelism and disables pipeline optimizations to ensure fair comparison of intrinsic scaffold–model synergies (Fan et al., 11 Sep 2025).
3. Key Findings and Interpretive Analysis
Multi-dimensional Rankings
SWE-Effi’s metrics substantially alter leaderboard rankings relative to raw resolve rates. For example, Agentless+Qwen3-32B achieves EuTB = 46.7% and EuCB = 47.1%—the highest observed—while AutoCodeRover+Qwen3 leads in EuCTB (37.9%). Some configurations (e.g., Agentless+Qwen3) that rank highest in token/cost efficiency display a relative drop in time efficiency, reflecting divergent per-token inference speeds.
Token Snowball and Expensive Failures
“Token snowball” denotes the cumulative prompt-size growth from naively appending each LLM call’s text. This causes:
- O() increase in token use over agent turns.
- Higher latency per call and worsened model attention.
- “Expensive failure” patterns where unresolved issues consume 4× more resources before timing out (e.g., 8.8M vs. 1.8M tokens in SWE-Agent+GPT-4o-mini) (Fan et al., 11 Sep 2025).
Budget Trade-offs
Explicit clashes are observed: configurations highly efficient under a token or dollar budget are not always fastest (EuITB, EuCTB), and vice versa. This matters for deployment regimes prioritizing response latency (e.g., RL rollouts) vs. budget-driven operations.
4. Integration with Related Efficiency and Experience-Reuse Frameworks
SWE-Effi’s metrics have shaped the evaluation of adjacent methodologies aimed at efficiency improvement:
- Experience-driven reuse: Frameworks like SWE-ContextBench, EET, and SWE-Replay anchor their claims of “efficiency gains” in -style metrics. For example, SWE-ContextBench measures time/cost efficiency as ratios (), demonstrating double-digit improvements (up to 60% runtime reduction on the hardest decile) when using succinct, well-selected experience summaries (Zhu et al., 9 Feb 2026).
- Profiling and self-optimization: EffiLearner formalizes execution-time () and memory-usage () reductions—e.g., 87.1% and 90.8% respectively—for LLM-generated code. This self-optimization loop aligns with the SWE-Effi agenda by evidencing substantial resource reduction without sacrificing correctness (Huang et al., 2024).
- Context management: SWE-Pruner offers context compression (23–54% token reduction) while maintaining accuracy, thereby improving token- and cost-efficiency as measured by SWE-Effi metrics (Wang et al., 23 Jan 2026).
- Early termination and futility detection: EET applies structured experience and confidence-based thresholds to curtail unproductive agentic iterations, achieving 19–55% cost reductions at negligible performance loss, as reflected in resource-normalized resolve rates (Guo et al., 9 Jan 2026).
5. Methodological Implications and Recommendations
SWE-Effi’s multidimensional effectiveness metrics drive several actionable agent design and evaluation principles:
- Optimizing for Synergy: Effectiveness emerges from interaction effects between LLM backend and agent scaffold, rather than from either in isolation.
- Memory abstraction: Replacing transcript accumulation with salient-fact or compressed state extraction mitigates token snowball, directly boosting EuTB and EuCB.
- Budget-aware planning: Incorporating futility signals and early-abort mechanisms (as in EET) prevents excessive resource consumption on unsolvable problems.
- RL fine-tuning: Lightweight scaffolds with high EuITB/EuCTB are preferable for scalable RL, as resource-inefficient scaffolds slow down trajectory generation and inflate cost.
- Open, extensible leaderboards: SWE-Effi enables cost-aware community benchmarking, accommodating new models, scaffolds, and resource axes.
6. Broader Context and Limitations
SWE-Effi is domain-agnostic in principle, but all reported benchmarks to date pertain to SWE-bench-derived repositories and Python code. Metrics based on token, cost, and time budgets assume relatively uniform per-token pricing and comparable hardware environments; normalization formulas attempt to address this but may require refinement as hardware/model diversity increases. A limitation is that SWE-Effi does not currently integrate memory or energy consumption directly—important dimensions highlighted in studies of agentic energy efficiency (SWEnergy) (Tripathy et al., 10 Dec 2025). A plausible implication is that extending SWE-Effi to blended metrics incorporating energy or memory would yield more holistic agent assessments.
7. Future Directions
Planned directions for SWE-Effi and its associated ecosystem include:
- Scaling to full benchmarks: Moving from 50-task subsets to the comprehensive SWE-bench pool.
- Fine-grained resource axes: Incorporating additional constraints (e.g., GPU memory, energy) as first-class dimensions.
- Adaptive trade-off tuning: Designing agentic policies informed by online SWE-Effi metric tracking, dynamically balancing resource dimensions in deployment.
- Automated futility learning: Embedding reinforcement or meta-learning to signal early termination based on evolving cost–effectiveness curves.
- Community extensibility: Open data, code, and leaderboard platforms enable collaborative agent design under standardized SWE-Effi constraints.
SWE-Effi provides a rigorous, multidimensional standard for judging software agent effectiveness under resource constraints, unifying advances in experience reuse, context management, code optimization, and agent design under a reproducible, quantitative framework (Fan et al., 11 Sep 2025).