EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents

Published 18 Dec 2024 in cs.CL, cs.AI, and cs.LG | (2412.13549v2)

Abstract: LLM agents excel in long-session planning and reasoning, but existing benchmarks primarily focus on goal-oriented tasks with explicit objectives, neglecting creative adaptation in unfamiliar environments. To address this, we introduce EscapeBench, a benchmark suite of room escape game environments designed to challenge agents with creative reasoning, unconventional tool use, and iterative problem-solving to uncover implicit goals. Our results show that current LM models, despite employing working memory and Chain-of-Thought reasoning, achieve only 15% average progress without hints, highlighting their limitations in creativity. To bridge this gap, we propose EscapeAgent, a framework designed to enhance creative reasoning through Foresight (innovative tool use) and Reflection (identifying unsolved tasks). Experiments show that EscapeAgent can execute action chains over 1,000 steps while maintaining logical coherence. It navigates and completes games with up to 40% fewer steps and hints, performs robustly across difficulty levels, and achieves higher action success rates with more efficient and innovative puzzle-solving strategies.

Abstract PDF HTML Upgrade to Chat

Authors (12)

Summary

The paper introduces EscapeBench as a novel benchmark to assess creative problem-solving in room escape scenarios for language models.
The paper presents the EscapeAgent framework with Foresight and Reflection modules that significantly lower step reliance and hint dependency by up to 40%.
The experimental results reveal fundamental limitations in traditional LMs and demonstrate improved long-chain reasoning and coherent action execution.

EscapeBench: Evaluating and Enhancing Creative Reasoning in LLMs

The paper presents EscapeBench, a novel benchmark designed to evaluate the creative reasoning abilities of LLM (LM) agents in room escape game environments. Traditional benchmarks for LMs have primarily assessed performance in goal-oriented tasks with clear objectives, often neglecting the aspect of creativity that involves adaptive problem-solving in unfamiliar scenarios. EscapeBench seeks to address this gap by challenging models with problems that require unconventional tool usage and iterative reasoning to uncover implicit objectives.

Core Contributions

Benchmark Introduction: EscapeBench provides a suite of room escape games where current LM agents, even when equipped with mechanisms like working memory and Chain-of-Thought (CoT) reasoning, demonstrate substantial limitations, achieving on average only 15% progress without hints.
EscapeAgent Framework: To enhance creative reasoning, the paper introduces EscapeAgent, a framework incorporating Foresight and Reflection modules. Foresight emphasizes innovative tool use, while Reflection focuses on task identification and tracking unsolved problems. These components work together to improve the creative capacities of LMs.
Empirical Validation: Experiments with EscapeAgent demonstrate significant improvements in task performance. The agent successfully executes action chains over 1,000 steps while maintaining logical coherence, reducing the number of steps and reliance on hints by up to 40% compared to baseline models.

Implications for AI Research

The paper's findings highlight the existing deficiencies in LLMs regarding creative problem-solving, an area often overshadowed by a focus on analytical intelligence. The successful implementation of EscapeAgent underscores the potential benefits of integrating reflective and foresighted reasoning capabilities into AI systems, particularly in scenarios requiring innovative thinking and adaptation.

From a practical perspective, enhancing a model's creative reasoning has direct implications for the development of AI agents used in dynamic, real-world applications, where adaptability and creativity are crucial. The inclusion of diverse and complex problem environments like those in EscapeBench can greatly contribute to the training of more robust, versatile AI systems.

Future Directions

The results presented suggest multiple avenues for further research. There is room to investigate the integration of multimodal inputs, which would allow agents to interpret visual and auditory cues alongside textual data, thereby creating a more realistic emulation of human-like reasoning. Additionally, exploring reinforcement learning paradigms within these creative contexts may yield better strategies for task completion without the need for extensive hand-designed rules or models.

Another intriguing prospect is the interplay between human and AI creativity. Collaborative problem-solving between humans and AI could combine the intuitive strengths of humans with the methodical reasoning of machines, potentially leading to more innovative solutions.

Conclusion

EscapeBench sets the stage for a more nuanced exploration of creativity in AI, particularly how LMs can overcome their conventional reasoning patterns to achieve higher-level reasoning and adaptability. As the landscape of AI continues to evolve, the pursuit of creative intelligence will remain a significant frontier, driving further advancements not only in the capabilities of AI systems but also in their applications across diverse domains.

Markdown Report Issue