XMiniGrid: Egocentric Gridworld Benchmark

Updated 14 October 2025

XMiniGrid is an egocentric, partially observable 2D gridworld environment designed for text-based navigation and object retrieval tasks, presenting challenges through sparse rewards and incomplete information.
It benchmarks LM agents' sample efficiency by employing counterfactual trajectory rewriting and compressed memory updates to transform failures into actionable learning experiences.
The integrated ECHO framework demonstrates up to 80% higher average rewards over baseline agents, highlighting its robust capacity for efficient online learning in sparse reward settings.

XMiniGrid is an egocentric, partially observable 2D gridworld environment designed for text-based navigation and object retrieval tasks. It serves as a benchmark for evaluating the sample efficiency and adaptive capabilities of LLM (LM) agents operating in sequential interaction domains with sparse, delayed rewards. The primary challenge presented by XMiniGrid is to enable agents to navigate complex layouts, discover and retrieve objects, and optimize their success despite incomplete information and infrequent feedback signals. Notably, XMiniGrid has been utilized in evaluating advanced LM agent frameworks such as ECHO (Experience Consolidation via Hindsight Optimization) that leverage counterfactual trajectory rewriting to enhance online learning.

1. Environment Specifications and Task Structure

XMiniGrid presents agents with a stateful, text-based simulation of a 2D grid comprising multiple rooms populated with objects of varying attributes (e.g., color, type). The agent receives partial, egocentric observations and must perform navigation and manipulation actions—such as moving between rooms and picking up objects—with the overarching goal of collecting specified items. Rewards are assigned upon successful task completion (object retrieval), with most interactions yielding little to no immediate feedback. The stateful nature of XMiniGrid means that the agent's decisions, world knowledge, and memory dynamics significantly affect long-term success rates.

2. Sample-Efficient Online Learning in XMiniGrid

Because interaction within XMiniGrid is costly and rewards are both sparse and delayed, the environment poses a stringent test of sample efficiency for LM agents. Standard LM-based agents typically struggle in such domains due to limited utilization of failure information and inefficient experience replay. ECHO, as introduced in "Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting" (Hu et al., 11 Oct 2025), adapts hindsight experience replay from RL for use in LM agents by synthesizing alternative positive trajectories from failures. When an agent fails to achieve its initial goal (e.g., retrieving a grey key), ECHO rewrites the experience to provide an optimized route and goal that are achievable in retrospect (e.g., picking up a grey star encountered during the failed attempt), thus yielding synthetic positive supervision for future queries.

3. The ECHO Framework: Trajectory Generation and Memory Compression

The ECHO framework comprises two principal components: the hindsight rule and the update rule.

Hindsight Rule: The LM agent analyzes the trajectory of its actions using summarization (LM.summarize), then identifies alternative achievable goals (LM.identify_goals) present in the trajectory. For each candidate goal, LM.infer_traj generates a revised, optimized trajectory reaching that goal via trajectory editing.
Update Rule: For each goal, ECHO maintains a compressed trajectory representation in experience memory—retaining only the shortest, most efficient workflow to minimize description length, inspired by Kolmogorov complexity. If a more efficient synthesized trajectory is discovered, it replaces the previous entry.

The process is illustrated by the following pseudocode:

def ECHO(LM, trajectory, replay_buf={}):
    summary = LM.summarize(trajectory)
    goals = LM.identify_goals(trajectory)
    for goal in goals:
        new_traj = LM.infer_traj(goal, trajectory)
        old_traj = replay_buf.get(goal)
        if old_traj is None or len(new_traj) < len(old_traj):
            replay_buf[goal] = new_traj
    return replay_buf

This mechanism ensures that experience memory remains optimal and compact, providing concise guidance for subsequent actions.

4. Unique Challenges in XMiniGrid

XMiniGrid introduces several unique difficulties:

Partial Observability: Agents do not have access to the full environment state, complicating the inference of object locations and feasible paths.
Sparse Rewards: Failures are prevalent during initial exploration; feedback is only occasionally provided, necessitating efficient learning from rare successes.
Goal Distraction: Focus on a single intended goal can lead to missed opportunities for collecting observable objects discovered during otherwise unsuccessful trajectories.

The ECHO framework directly addresses these challenges via counterfactual trajectory rewriting and memory compression. By synthesizing alternate successful workflows from failed attempts and actively maintaining an efficient experience buffer, LM agents can recover more swiftly and adapt to novel layouts.

5. Comparative Performance and Evaluation Metrics

Performance evaluation in XMiniGrid focuses on both the mean reward (success rate over fixed interaction horizons) and cumulative average reward. The cumulative metric is formalized as:

$\text{Cumulative Average Reward at } \tau = \frac{1}{\tau+1} \sum_{t=0}^{\tau} R_t$

where $R_t$ denotes the reward at episode $t$ . In comparative tests, ECHO demonstrates accelerated increases in cumulative reward, outperforming baseline agents (e.g., ReAct) and sophisticated architectures such as Reflexion and AWM.

ECHO yields up to 80% higher average reward than vanilla ReAct agents.
Synthesized trajectories are executable with approximately 85% success rate when compared to ground-truth execution.
ECHO’s memory update rule ensures retention of concise yet highly effective workflows, mitigating memory overhead while preserving informative supervision.
ECHO begins to outperform baseline agents within three interactions in XMiniGrid.

This evidence supports ECHO’s capacity for robust and sample-efficient learning, particularly in environments such as XMiniGrid where world model completion is infeasible.

Compared to prominent architectures:

Reflexion: Generates reflective commentary on failures, but does not synthesize or optimize actionable paths.
AWM: Extracts workflows only from successful episodes, lacking the capacity to rewrite or edit failed trajectories.
ECHO: Uniquely synthesizes actionable, optimized counterfactual trajectories and applies memory compression to maintain only the most efficient experiential templates.

The Editor's term “counterfactual rewriting” encapsulates ECHO’s approach of transforming failures into synthetic successes, an operation not performed by Reflexion or AWM.

7. Implications and Applications

ECHO’s performance in XMiniGrid provides evidence for the effectiveness of LLM agents in domains with sparse rewards and incomplete observability when equipped with mechanisms for hindsight trajectory editing and compressed experiential memory. This paradigm is applicable to other stateful text-based environments—such as enterprise simulations and collaborative information gathering—and suggests that adaptable trajectory management strategies can substantially improve online learning efficiency in LM agents faced with costly interactions and limited feedback.

PDF Markdown Chat (Pro)

References (1)

Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting (2025)

Follow Topic

Get notified by email when new papers are published related to XMiniGrid.