Jericho Benchmark for Interactive Fiction Agents

Updated 29 October 2025

Jericho Benchmark is an open-source testbed designed to evaluate autonomous agents in text-driven interactive fiction games.
It reduces combinatorial action spaces through template extraction and provides APIs for reinforcement learning and world modeling.
The platform advances research in natural language understanding, commonsense reasoning, and long-horizon planning within complex game scenarios.

The Jericho benchmark is a standardized testbed for the evaluation of autonomous agents in human-authored Interactive Fiction (IF) games, integrating challenges in combinatorial action spaces, natural language understanding, commonsense reasoning, and long-horizon planning. Jericho provides a curated suite of text-based adventure games, environment APIs, and mechanisms for action generation and world modeling, advancing reproducibility and rigor in language-based agent research.

1. Definition and Scope

Jericho is an open-source Python-based environment focused on IF games, designed to facilitate reinforcement learning and natural language agent research. These games comprise fully text-driven environments where agents issue natural language commands to affect change and progress narratives. Jericho’s API emulates OpenAI-Gym paradigms, offering step-wise interaction, load/save functionality for episodic algorithms, random seed control, and lower-level state inspection. Two game categories are distinguished: supported games (with score/move/world-change detection) and unsupported games (playable but lacking integrated metrics).

2. Game Suite and Action Space

Jericho supports 32 classic and modern IF games, spanning genres such as dungeon crawl (Zork), sci-fi, comedy, mystery, and horror, including both Infocom and community-authored titles. Each game employs a point-based reward system for reinforcement learning compatibility.

Combinatorial Action Space:

Actions consist of natural language commands; the action space per step scales exponentially with vocabulary size $\mathcal{V}$ and command length $n$ : $|\mathcal{V}|^n$ (e.g., $700^4 \approx 2.4 \times 10^{11}$ for 4-word commands).
Only a small subset forms grammatical, contextually valid instructions.

Jericho introduces per-game action template extraction (e.g., "put _ in _"), reducing action space complexity to $\mathcal{O}(\mathcal{T}\mathcal{V}^2)$ , where $\mathcal{T}$ is the number of templates. For Zork1, for instance, $237 \times 697 \approx 1.15 \times 10^8$ potential actions. Valid action detection leverages parser feedback and world-change analysis, making action feasibility computable.

3. Research Challenges

Three principal challenges are highlighted:

Combinatorial Action Space: Agents must generate plausible NL commands from an astronomical candidate pool.
Language Understanding & Commonsense: Success requires inferring objectives, context, and plausible affordances, such as reasoning that a chest is opened with a key.
Knowledge Representation (Textual SLAM): Agents must construct and update world and spatial models from only text descriptions, without canonical state access.

Jericho additionally exposes objects and their locations via a latent world object tree, enabling explicit state inspection and aiding research on object-centric RL or planning.

4. Benchmark Protocols and Agent Architectures

Evaluation Protocol:

Agents are assessed on 32 games via five training runs per game, reporting maximum percentage of achievable score post-100 episodes.
Handicaps such as determinism (fixed seeds), template/vocab access, and interaction with the world object tree are disclosed for fair comparison.

Reported baselines:

RANDOM agent: 1.8% completion
NAIL (Heuristic, non-trained): 4.9%
TDQN (Template-DQN): 6.1%
DRRN (Choice-based RL): 10.7%

Architectures evaluated:

DRRN: GRU encoders for both observation and action, with Q-value evaluation and valid-action softmax exploration.
TDQN: Extends LSTM-DQN to template-based action selection, Q-values decomposed over templates and slot fills.
NAIL: General IF agent leveraging manual heuristics, world model construction, and LLM inference.

Table: Difficulty Categories (from the benchmark) | Tier | Size of Action Space | Walkthrough Length | Average Steps per Reward | Tropes Present | |-----------|---------------------|--------------------|-------------------------|-----------------------| | Possible | $<10^8$ | $<80$ | $<8$ | Low | | Difficult | $>10^8$ | $>80$ | $>8$ | Inventory/Limited | | Extreme | $>10^{10}$ | $>180$ | $>25$ | Dialogue/Stochastic |

5. Experimental Findings and Insights

Key findings include:

Template-based action generation sharply reduces the search space, but valid, context-sensitive action generation remains a bottleneck.
Choice-based DRRN benefits from valid actions provided, yielding superior performance in selection tasks.
TDQN displays notable over-estimation in Q-learning due to large output dimension.
NAIL, a general heuristic agent, demonstrates moderate performance but lacks generalization to single-game agents optimized for specific titles.
Sparse rewards, necessity for valid action inference, and large state/action spaces result in agents plateauing far below full completion; numerous games remain unsolved.
Agent weaknesses include handling long-term dependencies, dialogue, inventory management, and robust planning.

6. Extensions: Planning Agents and Test-Time Learning

Recent research on the Jericho benchmark extends beyond RL:

MC-DML (Shi et al., 23 Apr 2025): Integrates Monte Carlo Tree Search (MCTS) with LLMs and dynamic memory, using in-trial and cross-trial reflection to guide exploration. MC-DML achieves state-of-the-art first-iteration scores in games like Zork1 (48.66 vs. RL SOTA 44.3), and planning efficiency far surpasses prior MCTS agents.
GLoW (Kim et al., 28 Sep 2025): Employs dual-scale world models (global frontier archive and local multi-path advantage reflection) to improve sample efficiency in hard-exploration games. GLoW matches RL SOTA in 7/10 games, with up to $800\times$ fewer environment steps.
J-TTL/EvoTest (He et al., 15 Oct 2025): Establishes Jericho Test-Time Learning, focusing on holistic, episode-over-episode agentic adaptation. EvoTest evolves the agent configuration—prompt, memory, hyperparameters, tool-routines—via evolutionary selection and UCB, consistently winning games where all baseline adaptation methods failed.

7. Impact and Future Directions

Jericho has advanced reproducibility and comparative research in language-based autonomous agents by unifying RL, planning, and LLM-based paradigms around real, complex text-based games.

Proposed directions include:

Unsupervised rewardless learning for unsupported games.
Conditional action generation modeling.
Application and benchmarking of transformer-based or modular architectures.
Modular self-evolving agents that leverage experience-driven adaptation across all agentic modules.
Enhanced credit assignment and semantic feedback via narrative transcript analysis, as exemplified by EvoTest.

A plausible implication is that Jericho, through its extensible interface and comprehensive game suite, will remain a central platform for benchmarking semantic reasoning, memory, planning, and rapid agent adaptation in large action spaces, fostering research into robust, human-like autonomous systems in language-grounded environments.