Jericho Test-Time Learning Benchmark

Updated 29 December 2025

The benchmark systematically evaluates AI agents in interactive fiction games modeled as POMDPs with combinatorial natural-language actions and delayed rewards.
It measures an agent’s self-improvement by updating internal states between episodes using methods like memory accumulation, reflection, and evolutionary adaptation.
Quantitative results show that holistic, multi-channel adaptation (e.g., EvoTest) significantly outperforms simpler strategies in achieving robust performance gains.

The Jericho Test-Time Learning (J-TTL) benchmark is an evaluation framework that systematically measures the ability of AI agents to perform self-improvement and adaptation across multiple consecutive episodes in interactive fiction (IF) environments. Each game in the benchmark is formulated as a partially observable Markov decision process (POMDP) featuring combinatorial natural-language action spaces, multi-step reasoning, and delayed sparse rewards. J-TTL explicitly quantifies an agent's capacity to update its internal components between episodes, leveraging only in-session feedback, without environment alteration or additional pre-training (He et al., 15 Oct 2025).

1. Formal Framework

J-TTL instantiates each IF game in the Jericho suite as a POMDP defined by:

S: Latent state space, not directly observed by the agent.
A: Unbounded, combinatorial natural-language command space.
T(s′|s,a): State transition dynamics.
R(s,a) ∈ ℝ: Scalar reward function, determined by score changes after each action.
Ω(o|s): Emission distribution yielding the (lowercased) textual description $o\in\Omega$ received by the agent.
T: Maximum step horizon per episode (default 110).

An agent-run episode $e$ results in a trajectory $\tau^{(e)} = (o_1^{(e)}, a_1^{(e)}, r_1^{(e)},\ldots,o_T^{(e)}, a_T^{(e)}, r_T^{(e)})$ and total return $R(e) = \sum_{t=1}^T r_t^{(e)}$ . A session comprises $K$ episodes (default $K=50$ ), each beginning from a fixed initial state $s_\mathrm{init}$ . After episode $e$ , the agent updates its internal state $\theta^{(e)}$ according to a test-time update rule $\theta^{(e+1)} = U(\theta^{(e)}, \tau^{(e)})$ .

Performance is evaluated using:

Per-episode return $R(1), R(2), \ldots, R(K)$ ("learning curve")
Improvement metric $\Delta R(e) = R(e+1) - R(e)$
Area Under Curve (AUC), normalized by the game’s maximum score $R_\mathrm{max}$ :

$\mathrm{AUC} = \frac{1}{K R_\mathrm{max}} \sum_{e=1}^K R(e),\quad 0 \leq \mathrm{AUC} \leq 1$

2. Environment Suite

Six publicly available Jericho IF games comprise the J-TTL suite, capturing a spectrum of puzzle complexity and search difficulty:

Detective
Library
Zork1
Zork3
Balances
Temple

Diversity in narrative logic, action ambiguity, and state observability ensures broad assessment of agent generalization. All games feature open-ended command input and reward structures reflecting canonical IF challenges.

3. Evaluation Protocol and Baseline Methods

The agent receives lowercased textual observations and interacts with the environment by issuing natural-language actions, receiving reward feedback corresponding to score increments. Each episode is limited to $T=110$ steps. After every episode, adaptation occurs via one of several families of methods:

Memory-based: Agents append raw episode transcripts (Memory) or retrieve salient context snippets (RAG).
Reflection-based: Agents summarize prior runs (Summary) or generate structured critiques (Reflexion).
Prompt optimization: Agents utilize textual gradient-inspired edits (TextGrad), prompt evolution with Promptbreeder, or API-driven evolution with EvoPrompt.
Parameter fine-tuning: Online supervised fine-tuning (SFT) and RL (GRPO) methods operate on learnable weights.
Evolutionary adaptation (EvoTest): Jointly optimizes high-level prompting, memory structures, hyperparameters, and tool routines after each episode.

All methods are evaluated with identical episode budgets and tested using major LLM backbones: google/gemini-2.5-flash, anthropic/claude-4-sonnet (for non-weight-update methods); qwen/qwen3-32b (for SFT/GRPO).

4. Quantitative Results

Mean AUC scores over 50-episode sessions across the six games are summarized below for key adaptation approaches:

Method	Detective	Library	Zork1	Zork3	Balances	Temple	Average
Static	0.21/0.23	0.15/0.16	0.03/0.04	0.05/0.06	0.11/0.12	0.08/0.09	0.11/0.12
Reflexion	0.58/0.60	0.41/0.44	0.09/0.11	0.25/0.27	0.30/0.32	0.29/0.31	0.32/0.34
EvoPrompt	0.65/0.67	0.48/0.50	0.10/0.12	0.30/0.32	0.24/0.26	0.27/0.29	0.34/0.36
EvoTest	0.94/0.95	0.77/0.80	0.14/0.16	0.35/0.38	0.32/0.35	0.31/0.34	0.47/0.50

All baselines fail to achieve complete ("win") conditions in any game; EvoTest alone achieves game completion in Detective and Library.
EvoTest yields an average improvement of approximately 38% AUC over the strongest prompt-evolution baselines and 57% over online RL.
Memory-based and reflection-based updates yield only incremental gains, confirming that simple transcript accumulation or prompt edits are insufficient for substantial adaptation (He et al., 15 Oct 2025).

5. Key Challenges in Test-Time Learning for Interactive Fiction

Several characteristics make test-time learning in IF fundamentally difficult:

Sparse and delayed rewards: Reinforcement learning and supervised fine-tuning methods encounter noisy gradients, compromising data efficiency.
Partial observability: State estimation is challenged by incomplete and ambiguous textual observations.
Combinatorial action space: Direct action selection and credit assignment are nontrivial due to open-ended command possibilities.
Context window overflow: Naïve memory-based retrieval results in excessive input lengths, while unstructured summarization causes surface-level changes only.
Holistic adaptation: Substantial self-improvement requires coordinated updating across prompts, structured episodic memory, exploration hyperparameters, and tool routines.

A plausible implication is that methods addressing only a single adaptation channel (e.g., memory or prompting) lack sufficient coverage for robust IF problem solving. Holistic, multi-channel adaptation emerges as essential.

6. Algorithmic Paradigms and Design Recommendations

The J-TTL benchmark results highlight design principles for effective test-time adaptation:

Holistically adapt high-level policy prompts, structured memory (success/failure tables), exploration parameters, and tool abstractions in tandem.
Leverage the narrative episode transcript as semantically dense feedback for credit assignment, ideally mediated by a dedicated "Evolver" agent.
Use structured memory partitioned into direct retrieval and guardrail-enforcing segments.
Apply principled exploration/exploitation (e.g., Upper Confidence Bound) over agent configuration space, not just primitive actions.
Employ lightweight, evolvable state extractor functions that condense trajectory history into salient milestones, limiting context window overload.
Favor gradient-free, API-driven updates to maximize test-time data efficiency and reduce hardware requirements.

Algorithmic instantiations such as EvoTest employ a two-tier loop: first, the actor agent executes an episode given a configuration $\chi^{(e)}$ ; then, the Evolver mutates configurations (including prompts, memory, hyperparameters, and tool routines), generating candidates evaluated and selected through UCB-based credit assignment.

7. Implications and Future Directions

J-TTL operationalizes the concept of agentic self-improvement in text-based, partially observable, dynamic environments, rigorously testing beyond static policy generalization or pre-trained memory. The empirical findings demonstrate that reflection, naïve memory, and online fine-tuning alone are inadequate for robust test-time learning. Instead, whole-system, gradient-free, multi-channel adaptation (as in EvoTest) is necessary to make substantive progress on challenging IF domains (He et al., 15 Oct 2025).

J-TTL establishes a clear target for developing agents with persistent, session-level adaptability—including credit assignment, memory structuring, and configuration evolution—in alignment with practical requirements for generalized interactive systems.

PDF Markdown Chat (Pro)

References (1)

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Jericho Test-Time Learning (J-TTL) Benchmark.