Papers
Topics
Authors
Recent
2000 character limit reached

EvoTest: Adaptive Framework for Test-Time Learning

Updated 29 December 2025
  • EvoTest frameworks are gradient-free, evolutionary systems that adapt agent configurations at test time without traditional fine-tuning.
  • They employ mutation of prompts, memory, hyperparameters, and tool routines, with UCB bandit selection optimizing performance across episodes.
  • EvoTest demonstrates superior empirical results on benchmarks like Jericho-TTL and HumanEval, outperforming standard RL and fine-tuning methods.

EvoTest frameworks are evolutionary, gradient-free paradigms designed for test-time adaptation of agentic or generative AI systems. They instantiate self-improving capability without parameter fine-tuning, instead leveraging evolutionary mechanisms, configuration mutation, and bandit-based selection across episodes or generations. EvoTest has been pioneered for both LLM agentic configuration learning in sequential-decision domains (He et al., 15 Oct 2025) and for co-evolutionary code generation/test-suite optimization (Duan et al., 2024), with a lineage rooted in earlier code evolution control frameworks (Insa et al., 2017). The EvoTest methodology represents a class of systems that operationalize rapid, evaluation-driven improvement by evolving component configurations, policies, and/or populations via evolutionary operators, semantic analysis, and fitness-based selection.

1. Core Principles and Motivations

EvoTest frameworks are motivated by the limitations of existing test-time learning or adaptation techniques—particularly their struggles with generalization to novel environments or tasks. Classic adaptation schemes such as agent-internal reflection, episodic memory, or RL fine-tuning are insufficient for complex distribution shift or multi-episode improvement targets, as empirically demonstrated on the Jericho Test-Time Learning (J-TTL) benchmark (He et al., 15 Oct 2025). EvoTest reframes the adaptation challenge as an evolutionary process: rather than updating model weights, it iteratively mutates the full agent configuration (prompts, memories, hyperparameters, and tool-use logic) and selects the most successful variants using sample-efficient fitness heuristics.

In program synthesis and code solution selection, EvoTest generalizes to co-evolving two interdependent populations (candidates and tests), leveraging evolutionary search to dynamically drive both correctness and diversity of solutions, as well as test coverage (Duan et al., 2024).

2. Agentic Evolutionary Test-Time Learning: Architecture and Loop

In agentic systems (e.g., LLM playing text games), EvoTest decomposes the adaptation process into two roles:

  • Actor Agent: Executes one full episode using a fixed configuration χ=(p,M,h,u)\chi = (p, M, h, u), where pp is the system prompt, MM is structured memory, hh are inference hyperparameters, and uu encodes tool-use routines. The actor outputs episode trajectory τ\tau and scalar return RR.
  • Evolver Agent: Consumes (χ,τ)(\chi, \tau), conducts transcript-level semantic analysis (e.g., with a fixed LLM), and produces a population of mutated “child” configurations via evolutionary operators (prompt mutation, memory update, hyperparameter tuning, tool routine refinement). The next configuration χ\chi' is selected using the Upper Confidence Bound (UCB) bandit rule that balances exploitation and exploration.

Formally, the episode-level update is

χ(e+1)=U(χ(e),τ(e))\chi^{(e+1)} = U(\chi^{(e)}, \tau^{(e)})

where UU is a discrete, non-gradient evolutionary operator.

The loop pseudocode is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
\begin{algorithm}[h]
\caption{EvoTest: Evolutionary Test‐Time Learning}
\begin{algorithmic}[1]
\Require Number of episodes %%%%11%%%%, initial config %%%%12%%%%, UCB constant %%%%13%%%%, number of children %%%%14%%%%
\For{%%%%15%%%% to %%%%16%%%%}
  \State %%%%17%%%%
  \State %%%%18%%%%
  \For{%%%%19%%%% to %%%%20%%%%}
    \State %%%%21%%%%
    \State %%%%22%%%%
    \State %%%%23%%%%
    \State %%%%24%%%%
  \EndFor
  \State %%%%25%%%%
\EndFor
\end{algorithmic}
\end{algorithm}

3. Configuration Mutation, Evolution Operators, and Selection

EvoTest applies evolutionary variation to the agentic configuration space X\mathcal{X}, where each χ\chi comprises (p,M,h,u)(p, M, h, u). Mutation is accomplished as follows:

  • Prompt Mutation: The LLM rewrites the prompt to reward successful strategies (“Walkthrough”) and penalize unsuccessful ones (“Guardrails”).
  • Memory Update: The agent parses the trajectory, logging state-action pairs preceding positive returns and negatives for repeated failures.
  • Hyperparameter Tuning: Temperature and related decoding parameters are tuned based on signs of looping or excessive randomness.
  • Tool-Use Refinement: State extractor and memory-query rules are edited to strengthen recall of successful behaviors or avoid dead-ends.

Children are sampled from a mutation distribution M(χ(e),τ(e))\mathcal{M}(\cdot\mid\chi^{(e)},\tau^{(e)}). The selection step for the next configuration uses the UCB score:

UCB(χ)=μ^(χ)+βln(χn(χ))1+n(χ){\rm UCB}(\chi) = \hat\mu(\chi) + \beta \sqrt{\frac{\ln(\sum_{\chi'} n(\chi'))}{1+n(\chi)}}

Where μ^(χ)\hat\mu(\chi) is empirical mean return, and β\beta is the exploration coefficient. This implements a (1 ⁣+ ⁣m)(1\!+\!m)-EA with bandit-based selection.

4. Co-Evolutionary Population-Based EvoTest for Code and Test Selection

In the context of automated code solution selection with test co-evolution (Duan et al., 2024), EvoTest instantiates interacting populations:

  • Code Individuals C={c1,,cN}C = \{c_1, \ldots, c_N\} represented as token sequences or ASTs.
  • Test-Case Individuals T={t1,,tM}T = \{t_1, \ldots, t_M\} as input/output pairs.

Fitness for code and test individuals balances correctness with diversity: Fc(ci)=α11Mj=1Mrij+β11N1kiEditDist(ci,ck)LmaxF_c(c_i) = \alpha_1 \frac{1}{M}\sum_{j=1}^M r_{ij} + \beta_1 \frac{1}{N-1} \sum_{k\neq i} \frac{{\rm EditDist}(c_i, c_k)}{L_{\max}}

Ft(tj)=α21Ni=1N(1rij)+β2VarInput(tj)VmaxF_t(t_j) = \alpha_2 \frac{1}{N}\sum_{i=1}^N (1-r_{ij}) + \beta_2 \frac{{\rm VarInput}(t_j)}{V_{\max}}

Where rijr_{ij} is success of code cic_i on test tjt_j, LmaxL_{\max} is max code length, and VmaxV_{\max} normalizes test input diversity.

Selection uses kk-tournament for both populations. Variation involves:

  • Code Crossover: Subtree or token-index crossovers.
  • Code Mutation: Random operator/constant/variable changes per-token.
  • Test Mutation: Input perturbations or blend crossovers for structured inputs.
  • Dynamic Test Generation: LLMs generate challenging cases targeting low coverage.

Termination occurs when a solution passes all tests or the fitness plateaus.

5. Empirical Performance and Benchmarks

Agentic EvoTest on Jericho-TTL

On the J-TTL benchmark (6 text-adventure games), EvoTest achieves:

  • Consistent AUC improvement: +0.47 for EvoTest (Gemini-2.5) vs. +0.34 for EvoPrompt and +0.30 for GRPO (online RL).
  • Game wins: Only EvoTest achieves full game wins in Detective and Library (no baseline achieves this).
  • Learning curves: Steep, stable increases in episode return; baseline methods stagnate or diverge.

This performance demonstrates the superiority of full-configuration evolution and UCB selection for episodic test-time adaptation in complex agentic domains (He et al., 15 Oct 2025).

Co-evolutionary EvoTest for Code on HumanEval

The evolutionary co-selection approach yields ∼10% absolute pass@1 improvement across codegen-16B, code-davinci-002, and incoder-6B, systematically outperforming non-evolutionary and baselines (AlphaCode, standard pass@k) (Duan et al., 2024).

Representative pass@k results:

Method Model pass@1 pass@2 pass@10
Baseline codegen-16B 29.7 50.3 73.7
AlphaCode code-davinci-002 55.1 64.1 84.4
AutoTest code-davinci-002 64.5 74.5 85.0

The framework also highlights the benefits of dynamic test renewal, diversity-aware fitness, and robust tournament selection.

Earlier evolutionary control frameworks such as the Erlang Code Evolution Control system (Insa et al., 2017) focus on behavioral regression via automatic test suite construction and POI (point of interest) trace comparison. While not labeled EvoTest, such frameworks share the test-driven, mutation-based improvement paradigm:

  • Type/behavior analysis seeds representative input coverage using TypEr, CutEr, PropEr.
  • Instrumentation tracks POI output traces; greedy mutation-based test generation seeks new behavioral coverage.
  • Regression is isolated to POI trace comparison between versions, minimizing oracle and human labeling requirements.

Key insights:

  • Configuration or input diversification prevents premature convergence to suboptimal behaviors or solutions.
  • Test-time learning benefits from direct narrative/trajectory analysis, not only reflection or memory replay.
  • Co-evolutionary interplay (as in AutoTest) systematically exposes corner cases by dynamically evolving both candidate sets and their test or evaluation regimes.

7. Significance, Limitations, and Generalization

EvoTest frameworks establish a class of sample-efficient, non-gradient test-time learning or adaptation mechanisms applicable in agentic, generative, and program synthesis contexts. By evolving agentic or solution configurations under selective pressure from episodic returns or test outcomes, they circumvent the challenges of direct gradient-based adaptation and fragile hand-designed update heuristics.

Limitations include:

  • Evolutionary search may be less efficient than gradient descent if informative mutations are infrequent in the configuration space.
  • Appropriate construction of fitness functions and mutation operators is critical for convergence and solution quality.
  • Applicability may be domain-dependent; behaviors readily expressible as prompt/memory/hyperparameter changes are most amenable.

A plausible implication is that EvoTest-style frameworks are promising for test-time adaptation where gradient-based tuning is infeasible or unsafe (e.g., LLMs without finetuning access, systems with non-differentiable modules), and where rapid behavioral improvement must be demonstrated under sample constraints.

References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to EvoTest Framework.