Evolutionary Test-Time Scaling

Updated 4 July 2026

Evolutionary test-time scaling is a method that enhances model inference by iteratively refining decisions using operators like mutation, selection, and memory updates.
It reallocates extra compute during deployment to search and improve candidate reasoning without modifying base model weights.
The approach spans diverse applications—from text and code to image generation and autonomous systems—offering practical improvements under real-world constraints.

Evolutionary test-time scaling denotes a family of inference-time methods that improve model behavior by allocating additional computation during deployment to iterative search, refinement, selection, memory update, or policy adaptation, rather than by enlarging training compute or changing the base model weights in the conventional way. In the recent literature, this idea appears in several closely related forms: population-based evolution over reasoning traces, episode-to-episode evolution of agent configurations, latent-space self-evolution, revision-and-verification loops, and trajectory search over generative or physical world-model states (He et al., 15 Oct 2025, Zhang et al., 22 Dec 2025, Zhang et al., 29 Sep 2025, He et al., 23 May 2025, Hashash et al., 22 Jun 2026). The unifying premise is that capability can be unlocked at inference time by treating reasoning or decision making as a structured dynamical process rather than a single forward pass.

1. Conceptual scope and relation to test-time scaling

Test-time scaling (TTS) is defined as improving model performance at inference time by spending additional test-time resources on reasoning, in contrast to training-time scaling laws over model size, dataset size, and training compute (Zhao et al., 23 Sep 2025). A structured survey organizes TTS into sampling-based, search-based, and trajectory optimization strategies, while a large comparative study further separates parallel scaling, sequential scaling, hybrid or meta scaling, and internal scaling (Chung et al., 5 Jun 2025, Agarwal et al., 1 Dec 2025). Within that broader field, evolutionary TTS is the subset in which extra inference compute is not merely spent on longer output or more samples, but on iterative improvement of a state that persists across steps, candidates, or episodes.

This persistence can take different forms. In some methods, the evolving object is a population of candidate solutions; in others, it is an agentic configuration, a latent control vector, a verifier-guided search frontier, or a posterior policy over actions (Zhang et al., 22 Dec 2025, He et al., 15 Oct 2025, Zhang et al., 29 Sep 2025, Tran et al., 24 May 2026, Hashash et al., 22 Jun 2026). This suggests that “evolutionary” in current usage is broader than classical genetic algorithms: it includes any test-time procedure that repeatedly generates alternatives, evaluates them, preserves useful structure, and reuses it to guide later computation.

A second conceptual distinction concerns what is being optimized. A system-oriented analysis argues that the dominant framing of TTS as a compute-optimal Pareto frontier is too narrow, because compute-optimal does not necessarily imply system-optimal; real deployments are constrained by latency, memory footprint, interconnect overhead, and cost-per-token (Zhao et al., 23 Sep 2025). Evolutionary TTS therefore sits at the intersection of reasoning methodology and inference systems design: it is about how additional computation is organized, not just how much is consumed.

2. Canonical evolutionary formulations

Several papers make the evolutionary structure explicit by specifying the state that evolves and the operator that updates it.

Setting	Evolving object	Characteristic operator
Population-Evolve (Zhang et al., 22 Dec 2025)	Population of reasoning traces $G^{(i)}$	Evolve prompt plus majority voting
EvoTest (He et al., 15 Oct 2025)	Agent configuration $\chi=(p,M,h,u)$	Transcript-conditioned mutation plus UCB
LatentEvolve (Zhang et al., 29 Sep 2025)	Latent sequence and memory buffer $\mathcal{M}$	Daytime retrieval and nighttime consolidation
EvoScale (Zeng et al., 29 May 2025)	Population of code patches $\mathcal{Y}^t$	Selection and conditional regeneration
EvoSearch (He et al., 23 May 2025)	Population of denoising states	Tournament selection, elitism, mutation

Population-Evolve provides one of the clearest abstractions. It defines a general TTS system as

$\mathcal{M}=\langle P,T,\mathcal{F}_{\phi},\mathcal{S}\rangle,$

with population size $P$ , evolution iterations $T$ , evolutionary operators $\mathcal{F}_{\phi}$ , and a final selection operator $\mathcal{S}$ (Zhang et al., 22 Dec 2025). For a query $q$ , the method samples an initial population $\chi=(p,M,h,u)$ 0, iteratively updates it by conditioning on the current population through an evolve prompt,

$\chi=(p,M,h,u)$ 1

and returns a final answer by majority voting over the last generation. The paper interprets this as a unification framework in which GenSelect is parallel but non-iterative, DSER is serial evolution with $\chi=(p,M,h,u)$ 2, and Population-Evolve combines parallel sampling with iterative evolution (Zhang et al., 22 Dec 2025).

EvoTest extends the same logic to agentic systems. Instead of evolving model weights, it evolves a holistic configuration

$\chi=(p,M,h,u)$ 3

where $\chi=(p,M,h,u)$ 4 is the policy prompt, $\chi=(p,M,h,u)$ 5 deployment-time memory, $\chi=(p,M,h,u)$ 6 hyperparameters, and $\chi=(p,M,h,u)$ 7 tool-use routines (He et al., 15 Oct 2025). The update rule is

$\chi=(p,M,h,u)$ 8

instantiated as transcript-conditioned evolution over $\chi=(p,M,h,u)$ 9. An Actor Agent executes one episode, an Evolver Agent analyzes the transcript, proposes mutated child configurations, updates memory, and selects the next configuration by UCB: $\mathcal{M}$ 0 The crucial point is that prompts, memory, exploration parameters, and tool-use policy all become mutable inference-time objects (He et al., 15 Oct 2025).

LatentEvolve moves the evolving object into latent space. It stores successful experiences as triplets $\mathcal{M}$ 1, retrieves top- $\mathcal{M}$ 2 neighbors for a new query, transfers historical optimization “momentum,” and refines the latent state by self-supervised optimization (Zhang et al., 29 Sep 2025). Its initialization is

$\mathcal{M}$ 3

followed by iterative updates of $\mathcal{M}$ 4. The framework alternates between “daytime scaling,” which retrieves and refines quickly, and “nighttime scaling,” which consolidates experience into a latent weaver trained to approximate refined latent trajectories (Zhang et al., 29 Sep 2025).

3. Search, revision, and backtracking as evolutionary operators

A broader class of methods does not always use explicit evolutionary vocabulary, but operationally performs the same functions of mutation, selection, and reuse.

Step-level verifier-guided hybrid TTS is an example of fine-grained evolutionary search. It combines parallel scaling via Best-of- $\mathcal{M}$ 5, sequential scaling via conditional step-level self-refinement, and tree-search or MCTS-style selection, all at the level of individual reasoning steps rather than full solutions (Chang et al., 21 Jul 2025). The process reward model (PRM) determines whether to refine, whether to accept a rewrite, and when to stop. The method uses a PUCT-style selection rule

$\mathcal{M}$ 6

with $\mathcal{M}$ 7 and $\mathcal{M}$ 8 (Chang et al., 21 Jul 2025). This is evolutionary TTS in the sense that candidate prefixes are repeatedly sampled, scored, locally edited, and propagated forward only if they improve.

“Beyond the Frontier” generalizes this idea by criticizing frontier-only PRM-guided search. It argues that beam-search-like and frontier-only SMC methods make PRM mis-rankings irreversible, causing premature commitment, diversity collapse, and loss of promising prefixes (Tran et al., 24 May 2026). The proposed remedy is stochastic backtracking over a persistent pool of historical prefixes. Subpool Selection performs Top- $\mathcal{M}$ 9 selection inside random subpools to let older prefixes bypass over-scored frontier candidates, and Power Backtrack Sequential Monte Carlo resamples from the whole historical pool using powered PRM scores and mixture-corrected weights (Tran et al., 24 May 2026). This makes evolutionary search memoryful: previously discarded states remain eligible for future compute.

REVES pushes the same logic into training. It treats revision-based TTS as a multi-step objective and argues that one-shot RLHF- or GRPO-style training is misaligned with deployment under sequential revision (Liu et al., 17 Jun 2026). Its core decomposition expresses sequential-revision performance as a weighted sum of local one-step recovery probabilities over visited states,

$\mathcal{Y}^t$ 0

and uses successful recovery trajectories to construct revision and verification prompts from “near-miss” intermediate errors (Liu et al., 17 Jun 2026). This suggests that evolutionary TTS need not remain purely an inference-time procedure; it can also become a training target that compresses iterative search behavior into the policy.

4. Domain expansion beyond text reasoning

Evolutionary TTS is not confined to mathematical chain-of-thought.

In software engineering, EvoScale formulates patch generation as iterative selection and mutation over code candidates (Zeng et al., 29 May 2025). Starting from

$\mathcal{Y}^t$ 1

it repeatedly selects an elite set $\mathcal{Y}^t$ 2 and regenerates the next population conditioned on those elites,

$\mathcal{Y}^t$ 3

The reinforcement-learning objective uses potential-based shaping with

$\mathcal{Y}^t$ 4

so that the model learns to improve over prior patches rather than merely sample independently (Zeng et al., 29 May 2025). On SWE-Bench-Verified, Satori-SWE-32B reaches 41.6 at Best@50, while Llama3-SWE-RL-70B Best@500 = 41.0, and the paper reports runtimes of 92.8s for unit tests, 18.1s for reward-model selection, and 16.6s for self-evolution (Zeng et al., 29 May 2025).

In image and video generation, EvoSearch recasts test-time scaling for diffusion and flow models as evolutionary search over denoising trajectories (He et al., 23 May 2025). It maintains a population of latent states, evaluates them by the reward of fully denoised outputs, preserves elites, performs tournament selection, and mutates either the initial Gaussian noise,

$\mathcal{Y}^t$ 5

or an intermediate denoising state,

$\mathcal{Y}^t$ 6

The paper reports monotonic gains with increasing NFEs, improved diversity, and specific video-generation improvements of 32.8% over Best-of- $\mathcal{Y}^t$ 7 on Wan 1.3B and 23.6% over Best-of- $\mathcal{Y}^t$ 8 on HunyuanVideo 13B (He et al., 23 May 2025).

In recommendation, prediction merging provides a parallel, consensus-based analogue. Multiple independently trained models or seeds produce predictions $\mathcal{Y}^t$ 9, and the final result is

$\mathcal{M}=\langle P,T,\mathcal{F}_{\phi},\mathcal{S}\rangle,$ 0

The paper explicitly notes that this is not evolutionary in the sense of explicit mutation and selection at test time, but it draws an “evolutionary search analogy” in which random initialization creates a population of models and the ensemble acts as population-level consensus (Lyu et al., 8 Dec 2025). Under the same inference budget, it reports that test-time scaling can outperform parameter scaling on Avazu, Criteo, and KDD12 (Lyu et al., 8 Dec 2025).

In physical AI, active inference is proposed as a test-time scaling law in which surprise triggers deliberative policy updating (Hashash et al., 22 Jun 2026). The surprise-adapted posterior policy is

$\mathcal{M}=\langle P,T,\mathcal{F}_{\phi},\mathcal{S}\rangle,$ 1

so the amount of inference-time reasoning scales with the mismatch between prediction and observation (Hashash et al., 22 Jun 2026). On an autonomous driving task with a jaywalking pedestrian at a green light, the method reports 22.9 test-time reward, versus -21.8 for Q-learning and -18.2 for Bayesian RL, with 100% success in OOD scenarios and >36% improvement in inference efficiency relative to always-on Bayesian planning (Hashash et al., 22 Jun 2026).

5. Evaluation, efficiency, and system-aware constraints

The evaluation of evolutionary TTS has expanded from raw accuracy-versus-compute plots to richer notions of efficiency. A system perspective argues that reasoning quality must be considered jointly with average end-to-end latency per request and cost-per-token, where

$\mathcal{M}=\langle P,T,\mathcal{F}_{\phi},\mathcal{S}\rangle,$ 2

On DeepSeek-R1-Distilled-Qwen and S1.1 at 1.5B, 7B, and 14B, evaluated on MATH500 with outputs from 1K to 16K tokens using vLLM, FlashAttention, and 4 NVIDIA GH200-96GB GPUs, speculative decoding consistently reduces latency, while tensor parallelism scales poorly for long-sequence reasoning and can even worsen latency for the 1.5B model (Zhao et al., 23 Sep 2025). The paper’s thesis is that compute-optimal is not necessarily system-optimal.

ARISE addresses the evaluation problem directly. It scores sample-level transitions across scaling steps using

$\mathcal{M}=\langle P,T,\mathcal{F}_{\phi},\mathcal{S}\rangle,$ 3

rewarding wrong $\mathcal{M}=\langle P,T,\mathcal{F}_{\phi},\mathcal{S}\rangle,$ 4correct transitions and penalizing correct $\mathcal{M}=\langle P,T,\mathcal{F}_{\phi},\mathcal{S}\rangle,$ 5wrong regressions in a token-aware way (Yin et al., 7 Oct 2025). Because degradation under more compute is treated as especially harmful, the metric can become negative, and the paper reports strongly negative ARISE values for gpt-oss-20B and gpt-oss-120B on some tasks (Yin et al., 7 Oct 2025). Dynamic sampling with $\mathcal{M}=\langle P,T,\mathcal{F}_{\phi},\mathcal{S}\rangle,$ 6, $\mathcal{M}=\langle P,T,\mathcal{F}_{\phi},\mathcal{S}\rangle,$ 7, and $\mathcal{M}=\langle P,T,\mathcal{F}_{\phi},\mathcal{S}\rangle,$ 8 reduces variance more efficiently than uniform resampling; the paper reports about 57.5% variance reduction for adaptive sampling versus about 31.4% for naive multiple sampling (Yin et al., 7 Oct 2025).

A large comparative study over eight open-source LLMs and over thirty billion tokens generated reports three broad trends: no single TTS strategy universally dominates; reasoning models split into short-horizon and long-horizon categories; and, for a fixed model type, optimal TTS performance scales monotonically with compute budget (Agarwal et al., 1 Dec 2025). The practical recipe is asymmetric: low-compute settings tend to favor cheap, short-trace strategies such as FFS-k@N or simple decoding, whereas high-compute settings favor majority voting over many samples (Agarwal et al., 1 Dec 2025). This result is directly relevant to evolutionary methods, because it implies that “more elaborate evolution” is not uniformly better; the optimal schedule is model-dependent.

Timely Machine further argues that in agentic settings the correct budget variable is wall-clock time, not generation length, because tool latency decouples tokens from actual elapsed time (Ma et al., 23 Jan 2026). It defines

$\mathcal{M}=\langle P,T,\mathcal{F}_{\phi},\mathcal{S}\rangle,$ 9

and shows that smaller models can outperform larger ones in low-latency regimes by taking more interaction rounds, while larger models dominate when tool latency is high and per-round interaction quality matters יותר (Ma et al., 23 Jan 2026). In evolutionary TTS for agents, this means the “generation” to be optimized may be a timed interaction policy rather than a token sequence.

6. Controversies, failure modes, and open directions

A major controversy concerns what should count as genuine test-time scaling. An analysis of “simple test-time scaling” argues that the apparent scaling curve in s1-style methods is largely produced by scaling down through maximum-length truncation, not by a learned ability to scale computation upward (Wu, 19 Jul 2025). The paper finds that fine-tuning on long chain-of-thought data has no significant impact on the scaling behavior, and that appending "Wait" produces oscillation and repetition rather than monotonic improvement. Under temperature 0.7, for example, r1-distill-Qwen-32B shows 96.7% answer repetition after the first and second "Wait" insertions, with 46.7% response repetition after the second (Wu, 19 Jul 2025). The broader implication is that a monotone-looking accuracy curve is not sufficient evidence of real evolutionary scaling.

A second recurrent issue is diversity. A survey argues that reasoning-optimized models often exhibit reduced output variance, which weakens the effectiveness of sampling, search, and evolutionary selection (Chung et al., 5 Jun 2025). Its proposed ADAPT method, a Diversity Aware Prefix fine-Tuning approach, reports 80% accuracy with 32 samples, whereas a distilled baseline requires 256 samples to reach the same threshold, implying 8 times less compute (Chung et al., 5 Jun 2025). This supports a general principle: evolutionary TTS needs a sufficiently varied candidate population for selection pressure to matter.

A third failure mode is overthinking. For LVLMs, a comprehensive study reports that small instruction-following models often benefit the most from TTS, with gains of up to around 30%, but also that LVLMs “lose focus when given more compute than necessary” (Sammani et al., 27 Jun 2026). Attention analysis shows that visual evidence is consumed early, after which the chain is dominated by text-only reasoning; late image-token KV dropping has little effect after about 200 generated tokens (Sammani et al., 27 Jun 2026). Related work on small VLMs therefore emphasizes efficient token-level aggregation and episodic test-time adaptation rather than expensive answer-level self-consistency (Kaya et al., 3 Oct 2025). This suggests that evolutionary TTS must remain modality-aware: in some regimes, longer evolution enhances reasoning; in others, it amplifies drift.

Current directions converge on three themes. First, system co-design is becoming integral: the relevant frontier is quality under realistic latency, token cost, memory, and hardware constraints (Zhao et al., 23 Sep 2025). Second, training is being aligned to iterative inference dynamics through revision, verification, or latent consolidation rather than static pass@1 optimization (Liu et al., 17 Jun 2026, Zhang et al., 29 Sep 2025). Third, evaluation is moving toward metrics and benchmarks that explicitly measure sequential improvement, negative scaling, and temporal adaptation, including J-TTL, ARISE, and Timely-Eval (He et al., 15 Oct 2025, Yin et al., 7 Oct 2025, Ma et al., 23 Jan 2026). Taken together, these developments indicate that evolutionary test-time scaling is no longer a narrow synonym for “sample more.” It is increasingly understood as the design of inference procedures that preserve and exploit structure across candidates, steps, episodes, or interactions under deployment-realistic constraints.