Scaling Laws for Agent Harnesses via Effective Feedback Compute

Published 28 May 2026 in cs.CL | (2605.29682v1)

Abstract: Agent harnesses increasingly determine the performance of language-model systems by deciding how models call tools, receive feedback, verify intermediate states, store memory, and revise solutions. Yet current test-time scaling analyses often parameterize this process by raw expenditure -- tokens, tool calls, operations, wall time, or cost -- which does not distinguish useful feedback from redundant or unstable interaction. We introduce \emph{Effective Feedback Compute} (EFC), a trace-level scaling coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for subsequent decisions, and we normalize it by task demand when comparing tasks with different feedback requirements. Across synthetic controllable tasks, executable code tasks, real benchmark traces, held-out splits, and a prospective validation batch, EFC-based coordinates consistently predict failure rates better than raw-compute baselines and a strong multivariate SAS baseline. In controlled scaling, raw tokens and tool calls explain limited variation ($R^2=0.33$ and $0.42$), SAS reaches $0.88$, while Oracle-EFC and Estimated-EFC reach $0.94$ and Oracle-EFC/$D_{\mathrm{task}}$ reaches $0.99$. Matched-budget interventions show that improving feedback quality raises success from $0.27$ to $0.90$ while raw cost and tool calls are fixed. On mixed real traces, NRS-EFC/$D_{\mathrm{task}}$ reaches $R^2=0.92$ while raw compute has near-zero or negative fit, and it remains the best predictor in a prospective holdout ($R^2=0.85$). These results suggest that harness scaling is governed less by how much computation is spent than by how efficiently raw budget is converted into durable, task-sufficient feedback.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents Effective Feedback Compute (EFC) as the primary scaling coordinate, achieving near-perfect predictive power with R² up to 0.99 compared to raw compute measures.
By isolating feedback quality from expenditure, the study shows that improving per-event EFC can dramatically boost success rates from 0.27 to 0.90 in controlled experiments.
The research underscores that both harness efficiency and task-demand normalization are critical for optimizing agent system design and ensuring robust performance across diverse tasks.

Scaling Laws for Agent Harnesses via Effective Feedback Compute

Introduction and Motivation

The paper "Scaling Laws for Agent Harnesses via Effective Feedback Compute" (2605.29682) advances the discourse on test-time scaling of agentic systems by identifying a new fundamental scaling variable—Effective Feedback Compute (EFC)—for agent harnesses. Traditional analyses of scaling in interactive LLM systems often rely on proxies of computation such as tokens generated, tool calls, or wall time. These proxies fail to capture the actual flow of problem-solving: whether the agent’s interactions yield informative, valid, non-redundant, and retained feedback that drives iterative refinement and ultimately determines solution success. The study systematically isolates and formalizes EFC as the principal quantity predicting harness-level performance across diverse agent tasks and system architectures.

Definitions and Formalization of EFC

EFC is introduced as a trace-level scalar that credits only those feedback events which (1) reveal task-relevant information, (2) are grounded in reliable evidence, (3) are non-redundant with respect to prior trajectory states, and (4) materially update the agent’s internal plan, memory, or candidate solution. This selectivity distinguishes EFC from aggregate raw expenditure by emphasizing interaction quality and downstream utility over quantity.

The event-level EFC assignment for each feedback event $e_t$ is governed by the product of bounded informativeness, validity, non-redundancy, and memory-retention factors, $EFC_t = k \cdot I_t V_t R_t M_t$ , where $k$ is a scaling constant. The run-level EFC aggregates these event scores, and a task-demand normalization $EFC/D_{task}$ allows direct comparison across tasks of varying feedback requirements. In real execution traces, the authors introduce Non-Redundant Stable EFC (NRS-EFC) to discount redundant or ephemeral observations, reflecting only durable, impactful feedback.

Experimental Methodology

A spectrum of task domains is considered—synthetic tasks with precise oracle feedback, semi-realistic executable code tasks, and real benchmark slices (including HumanEval, Terminal-Bench, and SWE-bench Verified)—culminating in prospective validation on unseen real traces. Multiple harnesses spanning from direct-answer, checklist-verify, routed-tools, stateful-memory, and deep closed-loop mechanisms are paired with state-of-the-art base models. Throughout, repeated runs and model sweeps control for stochasticity and ensure statistical robustness.

Multiple scalar predictors are compared: raw tokens, tool calls, wall time, operations, raw cost, and the strongly multivariate System-level Agent System scaling baseline (SAS). The core metric is the explanatory power (R² and MAE) of each coordinate with respect to empirical failure rates under a unified power-model scaling law.

Empirical Findings

EFC vs. Raw Compute

Across all domains, EFC-based coordinates—particularly task-normalized EFC ( $EFC/D_{task}$ )—show substantially stronger fit to observed failure rates than any raw-compute baseline or even advanced multivariate system predictors (SAS). For example, on synthetic tasks, raw tokens and tool calls reach $R^2$ of $0.33$ and $0.42$ respectively, SAS achieves $0.88$, while Oracle-EFC and Estimated-EFC both yield $0.94$, and Oracle-EFC/Dtask saturates at $EFC_t = k \cdot I_t V_t R_t M_t$ 0. This near-perfect fit is consistent across held-out and prospective validation sets.

Isolating Feedback Quality

A strictly controlled matched-budget intervention demonstrates that when raw cost and tool calls are fixed, merely improving feedback quality (i.e., raising per-event EFC) results in dramatic increases in success rate—from $EFC_t = k \cdot I_t V_t R_t M_t$ 1 to $EFC_t = k \cdot I_t V_t R_t M_t$ 2. This eliminates the confound that apparent improvements arise simply from greater expenditure, underscoring that how computation is spent, not how much, is causative.

Trace-Time Estimation and Transfer

Estimated-EFC, computed using only trace-observable proxies (checker events, plan updates, memory references), recovers the oracle EFC signal even before task outcomes are known. Task normalization further closes the gap between estimated and oracle coordinates, with mismatch nearly eliminated ( $EFC_t = k \cdot I_t V_t R_t M_t$ 3 rises to $EFC_t = k \cdot I_t V_t R_t M_t$ 4– $EFC_t = k \cdot I_t V_t R_t M_t$ 5).

Harness Design and Task Demand

Decomposition reveals two orthogonal governing factors:

Harness Efficiency ( $EFC_t = k \cdot I_t V_t R_t M_t$ 6): This measures the conversion rate from raw compute to effective feedback. Empirically, this conversion is sensitive to harness modules—router quality, verifier strength, memory fidelity—and robustly predicts success ( $EFC_t = k \cdot I_t V_t R_t M_t$ 7).
Task Demand ( $EFC_t = k \cdot I_t V_t R_t M_t$ 8): The absolute EFC required for high task success depends on the feedback complexity inherent in each task (reasoning steps, tool selection entropy, state tracking, observation noise, verifier visibility). Calibrated normalization is necessary for mixed task families, with learned weights outperforming fixed designs in transfer.

Generalization and Prospective Validation

The EFC framework retains its predictive advantage on entirely held-out axes (task family, harness, model, and combined splits). On real, heterogeneous benchmarks, NRS-EFC remains the strongest predictor—raw compute variables exhibit $EFC_t = k \cdot I_t V_t R_t M_t$ 9, SAS achieves moderate $k$ 0, while NRS-EFC/Dtask reaches up to $k$ 1. Slice-specific efficiency analysis reveals that harness performance is non-uniform across task types, evidencing a substantial interaction between harness design and task structure. In a prospective batch of unseen real traces, the prespecified NRS-EFC/Dtask coordinate again achieves the highest held-out $k$ 2 ( $k$ 3).

Theoretical Implications and Future Directions

This research fundamentally reframes agentic scaling: test-time scaling laws for agent harnesses are not governed by aggregate computation but by the rate and sufficiency of durable, actionable feedback. It establishes EFC not only as a superior predictor but as a mechanistic driver of harness performance—resolving a major ambiguity in test-time agent science.

Practically, this provides a rigorous basis for (1) targeted harness design (optimizing routing, verification, memory mechanisms), (2) adaptive budget allocation (dynamically prioritizing effective feedback extraction), and (3) cross-task evaluation (through normalization, with calibrated $k$ 4). Theoretically, it links agent-system performance to closed-loop feedback dynamics rather than open-loop expenditure, suggesting that future advances should focus less on unconditional scaling and more on improving agent-environment signal transduction and retention.

Future research avenues should extend EFC estimation to broader, open-ended environments and explore its integration as an optimization target for online harness adaptation. Refining $k$ 5 to better reflect real-world ambiguity and incorporating more sophisticated task demand predictors may further improve generalization and interpretability.

Conclusion

This work identifies Effective Feedback Compute as the principal scaling coordinate for agent harnesses and separates the contributions of feedback efficiency and task demand. The results challenge the sufficiency of raw compute and token-based predictors, providing a formal basis for analyzing and improving agentic systems via the mechanism of effective feedback accumulation and exploitation. The EFC paradigm is poised to significantly inform future agent-system scaling research by shifting the focus from expenditure-centric to feedback-centric test-time scaling laws.

Reference:

"Scaling Laws for Agent Harnesses via Effective Feedback Compute" (2605.29682)

Markdown Report Issue