Papers
Topics
Authors
Recent
2000 character limit reached

Jericho Test-Time Learning Benchmark

Updated 29 December 2025
  • The benchmark systematically evaluates AI agents in interactive fiction games modeled as POMDPs with combinatorial natural-language actions and delayed rewards.
  • It measures an agent’s self-improvement by updating internal states between episodes using methods like memory accumulation, reflection, and evolutionary adaptation.
  • Quantitative results show that holistic, multi-channel adaptation (e.g., EvoTest) significantly outperforms simpler strategies in achieving robust performance gains.

The Jericho Test-Time Learning (J-TTL) benchmark is an evaluation framework that systematically measures the ability of AI agents to perform self-improvement and adaptation across multiple consecutive episodes in interactive fiction (IF) environments. Each game in the benchmark is formulated as a partially observable Markov decision process (POMDP) featuring combinatorial natural-language action spaces, multi-step reasoning, and delayed sparse rewards. J-TTL explicitly quantifies an agent's capacity to update its internal components between episodes, leveraging only in-session feedback, without environment alteration or additional pre-training (He et al., 15 Oct 2025).

1. Formal Framework

J-TTL instantiates each IF game in the Jericho suite as a POMDP defined by:

  • S: Latent state space, not directly observed by the agent.
  • A: Unbounded, combinatorial natural-language command space.
  • T(s′|s,a): State transition dynamics.
  • R(s,a) ∈ ℝ: Scalar reward function, determined by score changes after each action.
  • Ω(o|s): Emission distribution yielding the (lowercased) textual description oΩo\in\Omega received by the agent.
  • T: Maximum step horizon per episode (default 110).

An agent-run episode ee results in a trajectory τ(e)=(o1(e),a1(e),r1(e),,oT(e),aT(e),rT(e))\tau^{(e)} = (o_1^{(e)}, a_1^{(e)}, r_1^{(e)},\ldots,o_T^{(e)}, a_T^{(e)}, r_T^{(e)}) and total return R(e)=t=1Trt(e)R(e) = \sum_{t=1}^T r_t^{(e)}. A session comprises KK episodes (default K=50K=50), each beginning from a fixed initial state sinits_\mathrm{init}. After episode ee, the agent updates its internal state θ(e)\theta^{(e)} according to a test-time update rule θ(e+1)=U(θ(e),τ(e))\theta^{(e+1)} = U(\theta^{(e)}, \tau^{(e)}).

Performance is evaluated using:

  • Per-episode return R(1),R(2),,R(K)R(1), R(2), \ldots, R(K) ("learning curve")
  • Improvement metric ΔR(e)=R(e+1)R(e)\Delta R(e) = R(e+1) - R(e)
  • Area Under Curve (AUC), normalized by the game’s maximum score RmaxR_\mathrm{max}:

AUC=1KRmaxe=1KR(e),0AUC1\mathrm{AUC} = \frac{1}{K R_\mathrm{max}} \sum_{e=1}^K R(e),\quad 0 \leq \mathrm{AUC} \leq 1

2. Environment Suite

Six publicly available Jericho IF games comprise the J-TTL suite, capturing a spectrum of puzzle complexity and search difficulty:

  • Detective
  • Library
  • Zork1
  • Zork3
  • Balances
  • Temple

Diversity in narrative logic, action ambiguity, and state observability ensures broad assessment of agent generalization. All games feature open-ended command input and reward structures reflecting canonical IF challenges.

3. Evaluation Protocol and Baseline Methods

The agent receives lowercased textual observations and interacts with the environment by issuing natural-language actions, receiving reward feedback corresponding to score increments. Each episode is limited to T=110T=110 steps. After every episode, adaptation occurs via one of several families of methods:

  • Memory-based: Agents append raw episode transcripts (Memory) or retrieve salient context snippets (RAG).
  • Reflection-based: Agents summarize prior runs (Summary) or generate structured critiques (Reflexion).
  • Prompt optimization: Agents utilize textual gradient-inspired edits (TextGrad), prompt evolution with Promptbreeder, or API-driven evolution with EvoPrompt.
  • Parameter fine-tuning: Online supervised fine-tuning (SFT) and RL (GRPO) methods operate on learnable weights.
  • Evolutionary adaptation (EvoTest): Jointly optimizes high-level prompting, memory structures, hyperparameters, and tool routines after each episode.

All methods are evaluated with identical episode budgets and tested using major LLM backbones: google/gemini-2.5-flash, anthropic/claude-4-sonnet (for non-weight-update methods); qwen/qwen3-32b (for SFT/GRPO).

4. Quantitative Results

Mean AUC scores over 50-episode sessions across the six games are summarized below for key adaptation approaches:

Method Detective Library Zork1 Zork3 Balances Temple Average
Static 0.21/0.23 0.15/0.16 0.03/0.04 0.05/0.06 0.11/0.12 0.08/0.09 0.11/0.12
Reflexion 0.58/0.60 0.41/0.44 0.09/0.11 0.25/0.27 0.30/0.32 0.29/0.31 0.32/0.34
EvoPrompt 0.65/0.67 0.48/0.50 0.10/0.12 0.30/0.32 0.24/0.26 0.27/0.29 0.34/0.36
EvoTest 0.94/0.95 0.77/0.80 0.14/0.16 0.35/0.38 0.32/0.35 0.31/0.34 0.47/0.50
  • All baselines fail to achieve complete ("win") conditions in any game; EvoTest alone achieves game completion in Detective and Library.
  • EvoTest yields an average improvement of approximately 38% AUC over the strongest prompt-evolution baselines and 57% over online RL.
  • Memory-based and reflection-based updates yield only incremental gains, confirming that simple transcript accumulation or prompt edits are insufficient for substantial adaptation (He et al., 15 Oct 2025).

5. Key Challenges in Test-Time Learning for Interactive Fiction

Several characteristics make test-time learning in IF fundamentally difficult:

  • Sparse and delayed rewards: Reinforcement learning and supervised fine-tuning methods encounter noisy gradients, compromising data efficiency.
  • Partial observability: State estimation is challenged by incomplete and ambiguous textual observations.
  • Combinatorial action space: Direct action selection and credit assignment are nontrivial due to open-ended command possibilities.
  • Context window overflow: Naïve memory-based retrieval results in excessive input lengths, while unstructured summarization causes surface-level changes only.
  • Holistic adaptation: Substantial self-improvement requires coordinated updating across prompts, structured episodic memory, exploration hyperparameters, and tool routines.

A plausible implication is that methods addressing only a single adaptation channel (e.g., memory or prompting) lack sufficient coverage for robust IF problem solving. Holistic, multi-channel adaptation emerges as essential.

6. Algorithmic Paradigms and Design Recommendations

The J-TTL benchmark results highlight design principles for effective test-time adaptation:

  • Holistically adapt high-level policy prompts, structured memory (success/failure tables), exploration parameters, and tool abstractions in tandem.
  • Leverage the narrative episode transcript as semantically dense feedback for credit assignment, ideally mediated by a dedicated "Evolver" agent.
  • Use structured memory partitioned into direct retrieval and guardrail-enforcing segments.
  • Apply principled exploration/exploitation (e.g., Upper Confidence Bound) over agent configuration space, not just primitive actions.
  • Employ lightweight, evolvable state extractor functions that condense trajectory history into salient milestones, limiting context window overload.
  • Favor gradient-free, API-driven updates to maximize test-time data efficiency and reduce hardware requirements.

Algorithmic instantiations such as EvoTest employ a two-tier loop: first, the actor agent executes an episode given a configuration χ(e)\chi^{(e)}; then, the Evolver mutates configurations (including prompts, memory, hyperparameters, and tool routines), generating candidates evaluated and selected through UCB-based credit assignment.

7. Implications and Future Directions

J-TTL operationalizes the concept of agentic self-improvement in text-based, partially observable, dynamic environments, rigorously testing beyond static policy generalization or pre-trained memory. The empirical findings demonstrate that reflection, naïve memory, and online fine-tuning alone are inadequate for robust test-time learning. Instead, whole-system, gradient-free, multi-channel adaptation (as in EvoTest) is necessary to make substantive progress on challenging IF domains (He et al., 15 Oct 2025).

J-TTL establishes a clear target for developing agents with persistent, session-level adaptability—including credit assignment, memory structuring, and configuration evolution—in alignment with practical requirements for generalized interactive systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Jericho Test-Time Learning (J-TTL) Benchmark.