Learning to Speak and Act in a Fantasy Text Adventure Game (1903.03094v1)

Published 7 Mar 2019 in cs.CL and cs.AI

Abstract: We introduce a large scale crowdsourced text adventure game as a research platform for studying grounded dialogue. In it, agents can perceive, emote, and act whilst conducting dialogue with other agents. Models and humans can both act as characters within the game. We describe the results of training state-of-the-art generative and retrieval models in this setting. We show that in addition to using past dialogue, these models are able to effectively use the state of the underlying world to condition their predictions. In particular, we show that grounding on the details of the local environment, including location descriptions, and the objects (and their affordances) and characters (and their previous actions) present within it allows better predictions of agent behavior and dialogue. We analyze the ingredients necessary for successful grounding in this setting, and how each of these factors relate to agents that can talk and act successfully.

Citations (196)

View on Semantic Scholar

Summary

The paper demonstrates that integrating context from locations, objects, and characters significantly enhances language grounding in interactive agents.
It employs a dataset of over 11,000 crowdsourced episodes to train and evaluate both BERT-based ranking and transformer generative models.
Empirical results reveal the BERT Bi-Ranker achieves 76.5% R@1 on seen tests, highlighting the benefits and challenges of grounded dialogue.

Analyzing Language Grounding in Fantasy Text Adventure Games

The paper "Learning to Speak and Act in a Fantasy Text Adventure Game" presents an exploration of situated dialogue agents within an interactive environment called LIGHT (Learning in Interactive Games with Humans and Text). The LIGHT platform offers a browser-based interface where users, either human or model-based agents, can engage in complex interactions. These interactions involve verbal dialogue, emotes, and actionable events within a large, diverse virtual fantasy world. This world is meticulously structured with 663 distinct locations, 3462 objects, and 1755 characters, each described in natural language.

Approach and Methodology

The primary research focus is on enhancing the grounding aspect of conversational agents. Grounding refers to the agent's ability to base on dialogues not just on past conversations but also on contextual knowledge, including location descriptions, immediate surroundings, objects, and other characters present. The authors hypothesize that integrating such grounding can lead to more coherent and contextually appropriate interactions by dialogue agents.

The paper presents a comprehensive dataset gathered through crowdsourcing, tailored for evaluating language grounding in gameplay scenarios. This dataset includes over 11,000 episodes of human-to-human interactions captured in this synthetic fantasy setting, replete with actions, emotes, and dialogues. The data serves as the foundation for training state-of-the-art generative and retrieval-based neural models that engage in situated conversation and action prediction.

Key Findings

For evaluating models, the paper primarily uses ranking techniques like BERT-based architectures (Bi-Ranker and Cross-Ranker) and generative models built on Transformer architectures. These models are rigorously tested in predicting contextually appropriate dialogues, emotes, and actions. Results indicate that models using a holistic view of contextual grounding outperform those relying solely on past dialogues. This highlights the critical role of encompassing environmental perception to elevate dialogue coherence and relevance.

Numerical Results and Impacts

BERT Bi-Ranker achieved a dialogue ranking of R@1 of 76.5% on a seen test set, contrasting with 70.5% on an unseen test set, indicating the challenges in transferring learned grounding capabilities to novel situations.
Generative Models, despite exploiting grounding features, showed lower efficacy on perplexity and F1 measurements compared to retrieval-based models.
Human baseline performance surpasses machine models, particularly in unseen test cases, underscoring the complexity of fully realizing grounded dialogue understanding.

Implications and Future Work

The research delineates the remarkable complexity involved in developing grounded language understanding for AI within interactive narratives. The synthesis of character interplay, object manipulation, and virtual environmental descriptions provide a rich tapestry for linguistic challenges not often encountered in other machine learning paradigms. The results offer profound implications for future AI development, particularly in areas requiring sophisticated human-computer interaction.

Future work can explore:

Enhancing model capacity to better handle unseen settings through transfer learning.
Further dissecting the impact of each grounding component (e.g., persona versus environmental description) on dialogue fluency.
Expanding datasets to include more diverse interaction types that might capture additional cultural elements or reference deeper narrative contexts.

In conclusion, this research provides a foundational step in embedding intelligent dialogue within interactive environments, leveraging comprehensive grounding to approach human-level communication complexity. As such, it stimulates a profound shift in the paradigms traditionally found in conversational AI, paving avenues for richer, immersive AI experiences.

PDF Markdown