Papers
Topics
Authors
Recent
2000 character limit reached

Scaled Test-Time Compute Optimization

Updated 1 December 2025
  • Scaled Test-Time Compute is a paradigm that dynamically allocates compute during inference to optimize LLM performance under cost and latency constraints.
  • It formalizes inference as a constrained graph optimization problem that balances parallel width and sequential depth through multi-LLM collaboration.
  • Empirical studies show that LLM-augmented probabilistic search, like Agent-REINFORCE, significantly improves accuracy and reduces latency compared to fixed deployment.

Scaled Test-Time Compute refers to the paradigm of dynamically allocating and optimizing computational resources at inference, particularly in LLMs, to maximize downstream performance under explicit cost or latency budgets. Unlike fixed, single-model deployment patterns, modern approaches formalize the search for optimal architectures, model assignments, and collaborative structures as a constrained combinatorial optimization problem. This encompasses parallel, sequential, and hybrid scaling, and allows the automatic discovery of multi-LLM collaboration graphs adapted to the requirements of specific tasks and use cases (Wang et al., 29 Oct 2025).

1. Formal Problem Setup: Multi-LLM Test-Time Scaling as Graph Optimization

The scaled test-time compute problem is formalized as a search over multi-LLM directed acyclic graphs G=(V,E,R,M)G = (V, E, R, M), where:

  • V={v1,...,vn}V = \{v_1, ..., v_n\} are nodes, each assigned a role ri{assistant,fuser}r_i \in \{\text{assistant}, \text{fuser}\} and a model MiM_i from a pool M\mathcal{M}.
  • EV×VE \subseteq V \times V are directed edges encoding information flow (outputs of viv_i appended to vjv_j's input).
  • RR and MM record role and model assignment per node.

Inference proceeds topologically: "assistant"-type nodes refine predecessors’ outputs; "fuser" nodes aggregate via techniques like voting or LLM fusion. The final output is produced by a unique sink node. Performance is measured by a utility function Accuracy(G)\mathrm{Accuracy}(G) (e.g., accuracy on development queries), subject to a cost function Cost(G)\mathrm{Cost}(G) that quantifies resource usage, commonly normalized FLOPs or monetary cost (Wang et al., 29 Oct 2025).

The core optimization: maxG Accuracy(G) s.t. Cost(G)C\max_G~\mathrm{Accuracy}(G)~\text{s.t.}~\mathrm{Cost}(G) \leq C with CC the compute or budget constraint.

2. Empirical Insights Governing Optimal TTS Graphs

Pilot empirical studies systematically reveal three guiding insights:

  • Model family dominance: Under fixed compute, tasks often benefit from replicating instances of the strongest single model rather than mixing models. For complex reasoning (e.g., MATH), numerous small models outperform single large ones; for knowledge-intensive tasks (e.g., MMLU), larger models dominate.
  • Non-monotonic width–depth scaling: Both increasing parallel width (independent samples) and sequential depth (refinement steps) show a plateau: marginal gains vanish or degrade beyond a task-specific optimum. Excessive width strains context limits; excessive depth accumulates compounding errors.
  • Interdependence of width and depth: Under fixed resources (wd=constw \cdot d = \text{const}), increasing parallelism optimally reduces sequential refinements, and vice versa. Search must coordinate both axes rather than optimize them independently.

These patterns inform search initialization, guiding practitioners toward size/family selection and balanced architectural exploration (Wang et al., 29 Oct 2025).

The search for compute-optimal collaboration graphs is cast as stochastic policy optimization over the discretized space of graphs:

  • Parameters (θ,π,ψ)(\theta, \pi, \psi) encode edge logits (p(eij)p(e_{ij})), node role probabilities (softmax(πi)(\pi_i)), and model selection probabilities (softmax(ψi)(\psi_i)).
  • The REINFORCE algorithm samples graphs, evaluates their accuracy, and performs policy-gradient-based updates:

θ1Ni=1Nuiθlogpθ(G(i))\nabla_\theta \approx \frac{1}{N} \sum_{i=1}^N u_i \nabla_\theta \log p_\theta(G^{(i)})

  • Critical innovation: A dedicated LLM agent processes sampled graphs and performance feedback, generating natural-language "gradient-like" instructions ("nudges") that modify (θ,π,ψ)(\theta, \pi, \psi) in accordance with empirical insights, accelerating convergence and adapting to semantic structures in the space (Wang et al., 29 Oct 2025).

Inference iterates sampling → feedback → numeric/text-guided updates, returning the MAP graph after convergence.

4. Joint Optimization of Accuracy and Latency

To capture the practical tradeoff between accuracy and user-facing latency, the objective is scalarized: u(G)=αAccuracy(G)βLatency(G)u(G) = \alpha \cdot \mathrm{Accuracy}(G) - \beta \cdot \mathrm{Latency}(G) where α,β\alpha, \beta are positive weights. The agent-based updates include latency-aware instructions (e.g., "Reduce width/depth to lower end-to-end latency"), resulting in Pareto-optimal graphs. Empirical results demonstrate up to 3x per-query speedup with only a minor drop in accuracy under such joint optimization (Wang et al., 29 Oct 2025).

5. Experimental Evaluation and Performance Frontier

Experiments benchmark Agent-REINFORCE and strong baselines (random search, Bayesian optimization, gradient-only REINFORCE, hybrid LLM search) on tasks including MATH, MMLU, and HumanEval, using a pool of LLaMA-3 and Gemma models across normalized FLOPs/dollar budgets.

Key findings:

  • Agent-REINFORCE achieves 61% accuracy at budget 80 (avg. latency 10s/query), outpacing the best baseline (43% accuracy, 22s latency).
  • Search time is reduced (532s vs. 2935s for the next-best), with Agent-REINFORCE dominating on the accuracy–latency Pareto frontier across resource regimes.
  • Architectural choices identified by the framework generalize stably to new queries, and practitioner-driven initialization (model family → size → instance count) further accelerates search for unseen tasks (Wang et al., 29 Oct 2025).
Method Accuracy (%) Latency (s/query) Search Time (s)
Agent-REINFORCE 61 10 532
Best Baseline 43 22 2935

6. Practical Guidelines for Scaled TTS Deployment

For effective deployment of scaled test-time compute:

  • Conduct single-model pre-tests to rank model families; focus initial search on the highest-performing family and model size.
  • Budgeting: Use normalized FLOPs or explicit monetary costs, but recognize that maximizing budget use can overshoot the optimal width/depth due to diminishing returns.
  • Interpret LLM agent feedback as actionable graph adjustments—prune or strengthen edges, adjust width/depth, etc.—and combine with numeric gradients.
  • For new tasks, reapply the three-stage initialization (family → size → instance count) to warm-start the search phase.
  • Task-specific architectures are recommended: deeper, smaller-model chains for reasoning, wider, large-model ensembles for knowledge (Wang et al., 29 Oct 2025).

By formalizing test-time compute scaling as a constrained, LLM-augmented graph optimization guided by empirical regularities in width and depth, scaled TTS achieves substantially higher accuracy and lower latency than previously possible. The Agent-REINFORCE framework provides a principled methodology adaptable to diverse inference, latency, and budget constraints.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Scaled Test-Time Compute.