Multi-Stage Episodic Control for Strategic Exploration in Text Games (2201.01251v3)

Published 4 Jan 2022 in cs.CL

Abstract: Text adventure games present unique challenges to reinforcement learning methods due to their combinatorially large action spaces and sparse rewards. The interplay of these two factors is particularly demanding because large action spaces require extensive exploration, while sparse rewards provide limited feedback. This work proposes to tackle the explore-vs-exploit dilemma using a multi-stage approach that explicitly disentangles these two strategies within each episode. Our algorithm, called eXploit-Then-eXplore (XTX), begins each episode using an exploitation policy that imitates a set of promising trajectories from the past, and then switches over to an exploration policy aimed at discovering novel actions that lead to unseen state spaces. This policy decomposition allows us to combine global decisions about which parts of the game space to return to with curiosity-based local exploration in that space, motivated by how a human may approach these games. Our method significantly outperforms prior approaches by 27% and 11% average normalized score over 12 games from the Jericho benchmark (Hausknecht et al., 2020) in both deterministic and stochastic settings, respectively. On the game of Zork1, in particular, XTX obtains a score of 103, more than a 2x improvement over prior methods, and pushes past several known bottlenecks in the game that have plagued previous state-of-the-art methods.

Authors (4)

Jens Tuyls (8 papers)
Shunyu Yao (72 papers)
Sham Kakade (84 papers)
Karthik Narasimhan (82 papers)

Citations (21)

View on Semantic Scholar

Summary

The paper presents the XTX algorithm, which separates exploitation and exploration phases to overcome sparse rewards in text games.
It leverages self-imitation learning and curiosity-driven value functions to efficiently navigate large action spaces.
Empirical results on the Jericho benchmark, including Zork1, show a 27% performance improvement and significant score enhancements.

Multi-stage Episodic Control for Strategic Exploration in Text Games

This paper addresses the inherent challenges faced by reinforcement learning (RL) in the domain of text adventure games, specifically tackling the issues arising from combinatorially large action spaces and sparse rewards. These challenges are further complicated by the fact that text-based games require a nuanced understanding of language, mirroring real-world complexities. To address this, the authors introduce a novel algorithm, eXploit-Then-eXplore (XTX), which strategically disentangles the exploration and exploitation phases within each episode of gameplay.

Core Contributions and Methodology

The primary contribution of the paper is the XTX framework, designed to separate the exploration-exploitation dilemma by implementing a distinct two-stage process during each gameplay episode:

Exploitation Phase: Initially, the agent follows an exploitation policy inspired by past successful trajectories. This phase aims to bring the agent back to promising states using a policy trained via self-imitation learning from the most rewarding trajectories. Such a strategic return policy ensures the agent can smoothly revert to the farthest reached parts of the game.
Exploration Phase: Following exploitation, the agent switches to an exploration policy driven by a value function incorporating both TD loss and a curiosity-driven inverse dynamics loss. This approach facilitates strategic exploration of new, potentially rewarding state spaces that were previously unreachable.

The XTX algorithm surpasses traditional approaches that use a unified policy for both tasks, thereby optimizing the agent’s performance by leveraging distinct strategies for exploration and exploitation. This nuanced approach reflects human strategies in handling similar complex decision environments.

Empirical Evaluation

The robustness of XTX was evaluated using the Jericho benchmark, comprising 12 interactive fiction games. In deterministic settings, XTX improved average normalized scores by 27% over prior methodologies. Critically, in the iconic game Zork1, XTX achieved an average score of 103, a two-fold enhancement over existing methods, successfully navigating known game bottlenecks that had previously hindered advanced RL methods.

In stochastic game settings, XTX maintained its efficacy, displaying robust performance across varied versions of the games, and particularly enhancing scores in Zork1, well beyond prior deterministic benchmarks.

Implications and Future Directions

The success of XTX underscores the importance of multi-stage learning strategies in environments characterised by large, dynamic action spaces and sparse reward mechanisms. By structuring RL around these tasks, it facilitates a more adaptive learning process that could be extrapolated to other complex domains requiring strategic decision-making, including real-world applications like automated customer service, conversational AI, and more.

The paper opens pathways for further research into how unsolved challenges in language understanding and complex decision spaces can be ameliorated. Future work could explore integration with models that more deeply leverage linguistic cues and other semantic structures, potentially enhancing agent performance by better understanding and leveraging context.

In conclusion, the research offers a promising avenue towards improving RL agents' strategic navigation of complex environments. The distinct treatment of exploration and exploitation phases holds significant promise for advancing AI proficiency in contexts far removed from traditionally structured environments, providing a strong foundation for future advancements in this domain.

PDF Markdown

Related Papers

GitHub

GitHub - princeton-nlp/XTX: [ICLR 2022 Spotlight] Multi-Stage Episodic Control for Strategic Exploration in Text Games (14 stars)

YouTube

Show All Videos