TALES: Text Adventure Learning Environment Suite (2504.14128v4)

Published 19 Apr 2025 in cs.AI and cs.CL

Abstract: Reasoning is an essential skill to enable LLMs to interact with the world. As tasks become more complex, they demand increasingly sophisticated and diverse reasoning capabilities for sequential decision-making, requiring structured reasoning over the context history to determine the next best action. We introduce TALES, a diverse collection of synthetic and human-written text-adventure games designed to challenge and evaluate diverse reasoning capabilities. We present results over a range of LLMs, open- and closed-weights, performing a qualitative analysis on the top performing models. Despite an impressive showing on synthetic games, even the top LLM-driven agents fail to achieve 15% on games designed for human enjoyment. Code and visualization of the experiments can be found at https://microsoft.github.io/tale-suite.

Summary

Text Adventure Learning Environment Suite: Evaluating Reasoning Capabilities of LLMs

Introduction

The paper presents the Text Adventure Learning Environment Suite, a framework designed to assess and challenge the reasoning abilities of LLMs. As tasks become increasingly intricate, they necessitate sophisticated reasoning and sequential decision-making skills. The benchmark introduced in this work seeks to explore these reasoning capabilities by utilizing text-adventure games, environments previously recognized for their complexity and long-horizon causal dependencies.

Core Reasoning Skills

The authors emphasize four critical reasoning skills for LLM-driven agents: spatial reasoning, deductive reasoning, inductive reasoning, and grounded reasoning. Each skill contributes distinctively to an agent's ability to successfully navigate and complete tasks within text-adventure environments:

Spatial Reasoning: This involves understanding spatial relationships and efficiently navigating through environments, which is crucial in interactive settings where path-finding and object localization are necessary.
Deductive Reasoning: This skill enables the agent to act based on general principles and is particularly vital when interactions are costly or constrained.
Inductive Reasoning: Here, the agent draws conclusions from observations and interactions, accommodating for unknown environment rules and adapting to context-specific affordances.
Grounded Reasoning: This involves situational awareness, where the agent must base decisions on relevant information from a growing historical context as tasks lengthen.

The Benchmark: Text Adventure Learning Environment Suite

The benchmark comprises synthetic and human-written text-adventure games designed to evaluate the composite reasoning skills outlined. Notably, the suite includes games in their canonical forms without biasing agents through excessive expert knowledge. A unique element of the benchmark is the inclusion of the “Simon Says” game, which serves as a preliminary test of an agent's ability to follow complex instruction sequences over extended contexts. The agents' performance on this task is highly predictive of their success in more comprehensive environments.

Evaluation and Results

The paper evaluates 34 LLMs across a suite of 122 games. Initial tests reveal that agents struggle significantly with long-horizon contexts, where the distribution of important information is sparse, suggesting strong limitations in current reasoning capabilities. Most notably, agents perform poorly in human-designed games, with even the top models failing to achieve notable scores in zero-shot settings.

Implications and Future Directions

The findings underscore the notable gaps in the current generation of LLMs regarding composite reasoning capabilities required for complex, environment-grounded tasks. This limitation highlights the need for further research to enhance LLMs' ability to handle long-horizon dependencies and implicit environmental rules, which are critical for achieving human-level performance in text-based interactive tasks.

The evolution of this benchmark presents opportunities for developing more resilient and adaptive LLMs, potentially leading to their application in real-world scenarios requiring sophisticated decision-making. It also sets a research direction for enhancing model architectures and training paradigms to better capture and utilize the reasoning skills necessary for complex tasks.

In conclusion, the Text Adventure Learning Environment Suite adds a valuable dimension to the benchmarking of LLMs, challenging researchers to push the boundaries of what these models can achieve in reasoning and decision-making, ultimately aiming for progress in AI systems that can effectively interact with and adapt to the complexities of real-world environments.