Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models (2404.03622v3)

Published 4 Apr 2024 in cs.CL

Abstract: LLMs have exhibited impressive performance in language comprehension and various reasoning tasks. However, their abilities in spatial reasoning, a crucial aspect of human cognition, remain relatively unexplored. Human possess a remarkable ability to create mental images of unseen objects and actions through a process known as the Mind's Eye, enabling the imagination of the unseen world. Inspired by this cognitive capacity, we propose Visualization-of-Thought (VoT) prompting. VoT aims to elicit spatial reasoning of LLMs by visualizing their reasoning traces, thereby guiding subsequent reasoning steps. We employed VoT for multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds. Experimental results demonstrated that VoT significantly enhances the spatial reasoning abilities of LLMs. Notably, VoT outperformed existing multimodal LLMs (MLLMs) in these tasks. While VoT works surprisingly well on LLMs, the ability to generate mental images to facilitate spatial reasoning resembles the mind's eye process, suggesting its potential viability in MLLMs. Please find the dataset and codes at https://microsoft.github.io/visualization-of-thought

PDF Abstract

Visualization-of-Thought Prompting Enhances LLMs' Spatial Reasoning

Introduction to Visualization-of-Thought (VoT) Prompting

Spatial reasoning is an indispensable aspect of human cognition, allowing us to navigate, interact with, and understand our environment. Despite LLMs' (LLMs) proficiency in various reasoning tasks, their capabilities in spatial reasoning have remained an underexplored territory. Building on the cognitive process humans employ—the ability to construct and manipulate mental images or the "mind's eye"—the paper proposes Visualization-of-Thought (VoT) prompting as a novel approach to elicit spatial reasoning in LLMs.

VoT prompting is premised on encouraging LLMs to visualize reasoning steps explicitly, thus providing a visuospatial sketchpad to aid subsequent reasoning processes. This method diverges from traditional approaches by focusing on the generation of mental images to navigate, perform visual navigation, and solve visual tiling problems in synthetic 2D grid environments. The results demonstrate VoT prompting's effectiveness in enhancing LLMs' spatial reasoning abilities, surpassing existing methodologies and multimodal LLMs (MLLMs) in these tasks.

Exploration of LLM's Spatial Reasoning Capabilities

The paper carefully selects three tasks—natural language navigation, visual navigation, and visual tiling—to rigorously test LLMs' spatial reasoning skills. These tasks require understanding space, direction, and geometric shape reasoning, making them ideal for assessing LLMs' spatial awareness.

Natural-Language Navigation

This task, inspired by human cognitive navigation capabilities, demanded LLMs navigate an underlying spatial structure described in text form. Models needed to recognize landmarks and navigate accordingly, showcasing their ability to form and utilize mental maps from textual descriptions—a task at which VoT prompting significantly excelled.

Visual Navigation and Tiling

Visual navigation and tiling tasks required models to interpret and navigate through a 2D grid world, simulating a more direct form of spatial reasoning. LLMs needed to generate navigation instructions or fit geometric shapes in a given space, tasks that necessitate a robust understanding of spatial relationships and constraints. The implementation of VoT prompting in these scenarios underscored its potential to guide LLMs in visualizing their thought processes, thereby significantly improving their performance.

Visualization-of-Thought Prompting Mechanism

VoT prompting augments LLMs with a "visuospatial sketchpad," enabling the visualization of reasoning steps. This approach, grounded in cognitive science, mirrors the way humans utilize mental imagery to enhance spatial awareness and inform decision-making. The method's efficiency shines in tasks requiring multi-hop spatial reasoning, with LLMs leveraging visualized sequences to navigate complex spatial environments effectively.

Empirical Validation and Methodological Strength

Experimental findings revealed VoT's superior performance across all tasks, where it notably outperformed contemporary multimodal LLMs. This margin underscores the merit of incorporating visuospatial reasoning capabilities into LLMs, suggesting that generating and manipulating mental images could be a pivotal factor in solving spatial reasoning tasks.

Additionally, the paper conducts an in-depth analysis of LLM outputs, revealing interesting insights into their ability for visual state tracking and the generation of diverse visualization formats. Intriguingly, LLMs demonstrated a self-refine mechanism, wherein they adjusted their reasoning upon detecting inconsistencies in visualizations, showcasing a nascent form of self-awareness and adaptability.

Future Directions and Implicit Opportunities

The researchers indicate the nascent stages of exploring LLMs' spatial reasoning capabilities through VoT prompting. With promising results in 2D grid worlds, future explorations could extend to more complex scenarios, including 3D environments, to further understand LLMs' potential in spatial reasoning. The exploration into factors enabling this emergent ability, such as exposure to ascii-art during training, provides intriguing avenues for understanding how LLMs develop cognitive-like capabilities.

Conclusion

The innovative approach of VoT prompting opens new horizons for enhancing LLMs' spatial reasoning abilities. By mirroring human cognitive processes, specifically the manipulation and visualization of mental images, VoT demonstrates remarkable potential in improving spatial awareness and reasoning in LLMs. As we venture further into understanding and enriching the "mind's eye" of LLMs, this research lays a foundational stone, promising profound implications for AI's cognitive capabilities in spatial reasoning and beyond.