Jericho Test-Time Learning Benchmark
- The benchmark systematically evaluates AI agents in interactive fiction games modeled as POMDPs with combinatorial natural-language actions and delayed rewards.
- It measures an agent’s self-improvement by updating internal states between episodes using methods like memory accumulation, reflection, and evolutionary adaptation.
- Quantitative results show that holistic, multi-channel adaptation (e.g., EvoTest) significantly outperforms simpler strategies in achieving robust performance gains.
The Jericho Test-Time Learning (J-TTL) benchmark is an evaluation framework that systematically measures the ability of AI agents to perform self-improvement and adaptation across multiple consecutive episodes in interactive fiction (IF) environments. Each game in the benchmark is formulated as a partially observable Markov decision process (POMDP) featuring combinatorial natural-language action spaces, multi-step reasoning, and delayed sparse rewards. J-TTL explicitly quantifies an agent's capacity to update its internal components between episodes, leveraging only in-session feedback, without environment alteration or additional pre-training (He et al., 15 Oct 2025).
1. Formal Framework
J-TTL instantiates each IF game in the Jericho suite as a POMDP defined by:
- S: Latent state space, not directly observed by the agent.
- A: Unbounded, combinatorial natural-language command space.
- T(s′|s,a): State transition dynamics.
- R(s,a) ∈ ℝ: Scalar reward function, determined by score changes after each action.
- Ω(o|s): Emission distribution yielding the (lowercased) textual description received by the agent.
- T: Maximum step horizon per episode (default 110).
An agent-run episode results in a trajectory and total return . A session comprises episodes (default ), each beginning from a fixed initial state . After episode , the agent updates its internal state according to a test-time update rule .
Performance is evaluated using:
- Per-episode return ("learning curve")
- Improvement metric
- Area Under Curve (AUC), normalized by the game’s maximum score :
2. Environment Suite
Six publicly available Jericho IF games comprise the J-TTL suite, capturing a spectrum of puzzle complexity and search difficulty:
- Detective
- Library
- Zork1
- Zork3
- Balances
- Temple
Diversity in narrative logic, action ambiguity, and state observability ensures broad assessment of agent generalization. All games feature open-ended command input and reward structures reflecting canonical IF challenges.
3. Evaluation Protocol and Baseline Methods
The agent receives lowercased textual observations and interacts with the environment by issuing natural-language actions, receiving reward feedback corresponding to score increments. Each episode is limited to steps. After every episode, adaptation occurs via one of several families of methods:
- Memory-based: Agents append raw episode transcripts (Memory) or retrieve salient context snippets (RAG).
- Reflection-based: Agents summarize prior runs (Summary) or generate structured critiques (Reflexion).
- Prompt optimization: Agents utilize textual gradient-inspired edits (TextGrad), prompt evolution with Promptbreeder, or API-driven evolution with EvoPrompt.
- Parameter fine-tuning: Online supervised fine-tuning (SFT) and RL (GRPO) methods operate on learnable weights.
- Evolutionary adaptation (EvoTest): Jointly optimizes high-level prompting, memory structures, hyperparameters, and tool routines after each episode.
All methods are evaluated with identical episode budgets and tested using major LLM backbones: google/gemini-2.5-flash, anthropic/claude-4-sonnet (for non-weight-update methods); qwen/qwen3-32b (for SFT/GRPO).
4. Quantitative Results
Mean AUC scores over 50-episode sessions across the six games are summarized below for key adaptation approaches:
| Method | Detective | Library | Zork1 | Zork3 | Balances | Temple | Average |
|---|---|---|---|---|---|---|---|
| Static | 0.21/0.23 | 0.15/0.16 | 0.03/0.04 | 0.05/0.06 | 0.11/0.12 | 0.08/0.09 | 0.11/0.12 |
| Reflexion | 0.58/0.60 | 0.41/0.44 | 0.09/0.11 | 0.25/0.27 | 0.30/0.32 | 0.29/0.31 | 0.32/0.34 |
| EvoPrompt | 0.65/0.67 | 0.48/0.50 | 0.10/0.12 | 0.30/0.32 | 0.24/0.26 | 0.27/0.29 | 0.34/0.36 |
| EvoTest | 0.94/0.95 | 0.77/0.80 | 0.14/0.16 | 0.35/0.38 | 0.32/0.35 | 0.31/0.34 | 0.47/0.50 |
- All baselines fail to achieve complete ("win") conditions in any game; EvoTest alone achieves game completion in Detective and Library.
- EvoTest yields an average improvement of approximately 38% AUC over the strongest prompt-evolution baselines and 57% over online RL.
- Memory-based and reflection-based updates yield only incremental gains, confirming that simple transcript accumulation or prompt edits are insufficient for substantial adaptation (He et al., 15 Oct 2025).
5. Key Challenges in Test-Time Learning for Interactive Fiction
Several characteristics make test-time learning in IF fundamentally difficult:
- Sparse and delayed rewards: Reinforcement learning and supervised fine-tuning methods encounter noisy gradients, compromising data efficiency.
- Partial observability: State estimation is challenged by incomplete and ambiguous textual observations.
- Combinatorial action space: Direct action selection and credit assignment are nontrivial due to open-ended command possibilities.
- Context window overflow: Naïve memory-based retrieval results in excessive input lengths, while unstructured summarization causes surface-level changes only.
- Holistic adaptation: Substantial self-improvement requires coordinated updating across prompts, structured episodic memory, exploration hyperparameters, and tool routines.
A plausible implication is that methods addressing only a single adaptation channel (e.g., memory or prompting) lack sufficient coverage for robust IF problem solving. Holistic, multi-channel adaptation emerges as essential.
6. Algorithmic Paradigms and Design Recommendations
The J-TTL benchmark results highlight design principles for effective test-time adaptation:
- Holistically adapt high-level policy prompts, structured memory (success/failure tables), exploration parameters, and tool abstractions in tandem.
- Leverage the narrative episode transcript as semantically dense feedback for credit assignment, ideally mediated by a dedicated "Evolver" agent.
- Use structured memory partitioned into direct retrieval and guardrail-enforcing segments.
- Apply principled exploration/exploitation (e.g., Upper Confidence Bound) over agent configuration space, not just primitive actions.
- Employ lightweight, evolvable state extractor functions that condense trajectory history into salient milestones, limiting context window overload.
- Favor gradient-free, API-driven updates to maximize test-time data efficiency and reduce hardware requirements.
Algorithmic instantiations such as EvoTest employ a two-tier loop: first, the actor agent executes an episode given a configuration ; then, the Evolver mutates configurations (including prompts, memory, hyperparameters, and tool routines), generating candidates evaluated and selected through UCB-based credit assignment.
7. Implications and Future Directions
J-TTL operationalizes the concept of agentic self-improvement in text-based, partially observable, dynamic environments, rigorously testing beyond static policy generalization or pre-trained memory. The empirical findings demonstrate that reflection, naïve memory, and online fine-tuning alone are inadequate for robust test-time learning. Instead, whole-system, gradient-free, multi-channel adaptation (as in EvoTest) is necessary to make substantive progress on challenging IF domains (He et al., 15 Oct 2025).
J-TTL establishes a clear target for developing agents with persistent, session-level adaptability—including credit assignment, memory structuring, and configuration evolution—in alignment with practical requirements for generalized interactive systems.