- The paper presents the XTX algorithm, which separates exploitation and exploration phases to overcome sparse rewards in text games.
- It leverages self-imitation learning and curiosity-driven value functions to efficiently navigate large action spaces.
- Empirical results on the Jericho benchmark, including Zork1, show a 27% performance improvement and significant score enhancements.
Multi-stage Episodic Control for Strategic Exploration in Text Games
This paper addresses the inherent challenges faced by reinforcement learning (RL) in the domain of text adventure games, specifically tackling the issues arising from combinatorially large action spaces and sparse rewards. These challenges are further complicated by the fact that text-based games require a nuanced understanding of language, mirroring real-world complexities. To address this, the authors introduce a novel algorithm, eXploit-Then-eXplore (XTX), which strategically disentangles the exploration and exploitation phases within each episode of gameplay.
Core Contributions and Methodology
The primary contribution of the paper is the XTX framework, designed to separate the exploration-exploitation dilemma by implementing a distinct two-stage process during each gameplay episode:
- Exploitation Phase: Initially, the agent follows an exploitation policy inspired by past successful trajectories. This phase aims to bring the agent back to promising states using a policy trained via self-imitation learning from the most rewarding trajectories. Such a strategic return policy ensures the agent can smoothly revert to the farthest reached parts of the game.
- Exploration Phase: Following exploitation, the agent switches to an exploration policy driven by a value function incorporating both TD loss and a curiosity-driven inverse dynamics loss. This approach facilitates strategic exploration of new, potentially rewarding state spaces that were previously unreachable.
The XTX algorithm surpasses traditional approaches that use a unified policy for both tasks, thereby optimizing the agent’s performance by leveraging distinct strategies for exploration and exploitation. This nuanced approach reflects human strategies in handling similar complex decision environments.
Empirical Evaluation
The robustness of XTX was evaluated using the Jericho benchmark, comprising 12 interactive fiction games. In deterministic settings, XTX improved average normalized scores by 27% over prior methodologies. Critically, in the iconic game Zork1, XTX achieved an average score of 103, a two-fold enhancement over existing methods, successfully navigating known game bottlenecks that had previously hindered advanced RL methods.
In stochastic game settings, XTX maintained its efficacy, displaying robust performance across varied versions of the games, and particularly enhancing scores in Zork1, well beyond prior deterministic benchmarks.
Implications and Future Directions
The success of XTX underscores the importance of multi-stage learning strategies in environments characterised by large, dynamic action spaces and sparse reward mechanisms. By structuring RL around these tasks, it facilitates a more adaptive learning process that could be extrapolated to other complex domains requiring strategic decision-making, including real-world applications like automated customer service, conversational AI, and more.
The paper opens pathways for further research into how unsolved challenges in language understanding and complex decision spaces can be ameliorated. Future work could explore integration with models that more deeply leverage linguistic cues and other semantic structures, potentially enhancing agent performance by better understanding and leveraging context.
In conclusion, the research offers a promising avenue towards improving RL agents' strategic navigation of complex environments. The distinct treatment of exploration and exploitation phases holds significant promise for advancing AI proficiency in contexts far removed from traditionally structured environments, providing a strong foundation for future advancements in this domain.