- The paper presents Effective Feedback Compute (EFC) as the primary scaling coordinate, achieving near-perfect predictive power with R² up to 0.99 compared to raw compute measures.
- By isolating feedback quality from expenditure, the study shows that improving per-event EFC can dramatically boost success rates from 0.27 to 0.90 in controlled experiments.
- The research underscores that both harness efficiency and task-demand normalization are critical for optimizing agent system design and ensuring robust performance across diverse tasks.
Scaling Laws for Agent Harnesses via Effective Feedback Compute
Introduction and Motivation
The paper "Scaling Laws for Agent Harnesses via Effective Feedback Compute" (2605.29682) advances the discourse on test-time scaling of agentic systems by identifying a new fundamental scaling variable—Effective Feedback Compute (EFC)—for agent harnesses. Traditional analyses of scaling in interactive LLM systems often rely on proxies of computation such as tokens generated, tool calls, or wall time. These proxies fail to capture the actual flow of problem-solving: whether the agent’s interactions yield informative, valid, non-redundant, and retained feedback that drives iterative refinement and ultimately determines solution success. The study systematically isolates and formalizes EFC as the principal quantity predicting harness-level performance across diverse agent tasks and system architectures.
EFC is introduced as a trace-level scalar that credits only those feedback events which (1) reveal task-relevant information, (2) are grounded in reliable evidence, (3) are non-redundant with respect to prior trajectory states, and (4) materially update the agent’s internal plan, memory, or candidate solution. This selectivity distinguishes EFC from aggregate raw expenditure by emphasizing interaction quality and downstream utility over quantity.
The event-level EFC assignment for each feedback event et​ is governed by the product of bounded informativeness, validity, non-redundancy, and memory-retention factors, EFCt​=k⋅It​Vt​Rt​Mt​, where k is a scaling constant. The run-level EFC aggregates these event scores, and a task-demand normalization EFC/Dtask​ allows direct comparison across tasks of varying feedback requirements. In real execution traces, the authors introduce Non-Redundant Stable EFC (NRS-EFC) to discount redundant or ephemeral observations, reflecting only durable, impactful feedback.
Experimental Methodology
A spectrum of task domains is considered—synthetic tasks with precise oracle feedback, semi-realistic executable code tasks, and real benchmark slices (including HumanEval, Terminal-Bench, and SWE-bench Verified)—culminating in prospective validation on unseen real traces. Multiple harnesses spanning from direct-answer, checklist-verify, routed-tools, stateful-memory, and deep closed-loop mechanisms are paired with state-of-the-art base models. Throughout, repeated runs and model sweeps control for stochasticity and ensure statistical robustness.
Multiple scalar predictors are compared: raw tokens, tool calls, wall time, operations, raw cost, and the strongly multivariate System-level Agent System scaling baseline (SAS). The core metric is the explanatory power (R² and MAE) of each coordinate with respect to empirical failure rates under a unified power-model scaling law.
Empirical Findings
EFC vs. Raw Compute
Across all domains, EFC-based coordinates—particularly task-normalized EFC (EFC/Dtask​)—show substantially stronger fit to observed failure rates than any raw-compute baseline or even advanced multivariate system predictors (SAS). For example, on synthetic tasks, raw tokens and tool calls reach R2 of $0.33$ and $0.42$ respectively, SAS achieves $0.88$, while Oracle-EFC and Estimated-EFC both yield $0.94$, and Oracle-EFC/Dtask saturates at EFCt​=k⋅It​Vt​Rt​Mt​0. This near-perfect fit is consistent across held-out and prospective validation sets.
Isolating Feedback Quality
A strictly controlled matched-budget intervention demonstrates that when raw cost and tool calls are fixed, merely improving feedback quality (i.e., raising per-event EFC) results in dramatic increases in success rate—from EFCt​=k⋅It​Vt​Rt​Mt​1 to EFCt​=k⋅It​Vt​Rt​Mt​2. This eliminates the confound that apparent improvements arise simply from greater expenditure, underscoring that how computation is spent, not how much, is causative.
Trace-Time Estimation and Transfer
Estimated-EFC, computed using only trace-observable proxies (checker events, plan updates, memory references), recovers the oracle EFC signal even before task outcomes are known. Task normalization further closes the gap between estimated and oracle coordinates, with mismatch nearly eliminated (EFCt​=k⋅It​Vt​Rt​Mt​3 rises to EFCt​=k⋅It​Vt​Rt​Mt​4–EFCt​=k⋅It​Vt​Rt​Mt​5).
Harness Design and Task Demand
Decomposition reveals two orthogonal governing factors:
- Harness Efficiency (EFCt​=k⋅It​Vt​Rt​Mt​6): This measures the conversion rate from raw compute to effective feedback. Empirically, this conversion is sensitive to harness modules—router quality, verifier strength, memory fidelity—and robustly predicts success (EFCt​=k⋅It​Vt​Rt​Mt​7).
- Task Demand (EFCt​=k⋅It​Vt​Rt​Mt​8): The absolute EFC required for high task success depends on the feedback complexity inherent in each task (reasoning steps, tool selection entropy, state tracking, observation noise, verifier visibility). Calibrated normalization is necessary for mixed task families, with learned weights outperforming fixed designs in transfer.
Generalization and Prospective Validation
The EFC framework retains its predictive advantage on entirely held-out axes (task family, harness, model, and combined splits). On real, heterogeneous benchmarks, NRS-EFC remains the strongest predictor—raw compute variables exhibit EFCt​=k⋅It​Vt​Rt​Mt​9, SAS achieves moderate k0, while NRS-EFC/Dtask reaches up to k1. Slice-specific efficiency analysis reveals that harness performance is non-uniform across task types, evidencing a substantial interaction between harness design and task structure. In a prospective batch of unseen real traces, the prespecified NRS-EFC/Dtask coordinate again achieves the highest held-out k2 (k3).
Theoretical Implications and Future Directions
This research fundamentally reframes agentic scaling: test-time scaling laws for agent harnesses are not governed by aggregate computation but by the rate and sufficiency of durable, actionable feedback. It establishes EFC not only as a superior predictor but as a mechanistic driver of harness performance—resolving a major ambiguity in test-time agent science.
Practically, this provides a rigorous basis for (1) targeted harness design (optimizing routing, verification, memory mechanisms), (2) adaptive budget allocation (dynamically prioritizing effective feedback extraction), and (3) cross-task evaluation (through normalization, with calibrated k4). Theoretically, it links agent-system performance to closed-loop feedback dynamics rather than open-loop expenditure, suggesting that future advances should focus less on unconditional scaling and more on improving agent-environment signal transduction and retention.
Future research avenues should extend EFC estimation to broader, open-ended environments and explore its integration as an optimization target for online harness adaptation. Refining k5 to better reflect real-world ambiguity and incorporating more sophisticated task demand predictors may further improve generalization and interpretability.
Conclusion
This work identifies Effective Feedback Compute as the principal scaling coordinate for agent harnesses and separates the contributions of feedback efficiency and task demand. The results challenge the sufficiency of raw compute and token-based predictors, providing a formal basis for analyzing and improving agentic systems via the mechanism of effective feedback accumulation and exploitation. The EFC paradigm is poised to significantly inform future agent-system scaling research by shifting the focus from expenditure-centric to feedback-centric test-time scaling laws.
Reference:
"Scaling Laws for Agent Harnesses via Effective Feedback Compute" (2605.29682)