GenSpace: Benchmarking Spatially-Aware Image Generation (2505.24870v2)

Published 30 May 2025 in cs.CV

Abstract: Humans can intuitively compose and arrange scenes in the 3D space for photography. However, can advanced AI image generators plan scenes with similar 3D spatial awareness when creating images from text or image prompts? We present GenSpace, a novel benchmark and evaluation pipeline to comprehensively assess the spatial awareness of current image generation models. Furthermore, standard evaluations using general Vision-LLMs (VLMs) frequently fail to capture the detailed spatial errors. To handle this challenge, we propose a specialized evaluation pipeline and metric, which reconstructs 3D scene geometry using multiple visual foundation models and provides a more accurate and human-aligned metric of spatial faithfulness. Our findings show that while AI models create visually appealing images and can follow general instructions, they struggle with specific 3D details like object placement, relationships, and measurements. We summarize three core limitations in the spatial perception of current state-of-the-art image generation models: 1) Object Perspective Understanding, 2) Egocentric-Allocentric Transformation and 3) Metric Measurement Adherence, highlighting possible directions for improving spatial intelligence in image generation.

Summary

GenSpace: Benchmarking Spatially-Aware Image Generation

The paper "GenSpace: Benchmarking Spatially-Aware Image Generation" focuses on evaluating the spatial awareness capabilities of advanced AI image generation models. Spatial awareness is a crucial aspect of photography, where humans naturally organize and compose scenes in 3D space. However, AI models, despite their proficiency in generating visually appealing images, encounter significant challenges when translating text or image prompts into spatially coherent images.

Core Contributions

Benchmarking Spatial Awareness in Image Generation Models: GenSpace establishes a benchmark and evaluation pipeline specifically designed to assess the spatial awareness of image generation models. The framework is grounded in real-world photographic composition, systematically categorizing spatial awareness into three dimensions: Spatial Pose, Spatial Relation, and Spatial Measurement.
Evaluation Pipeline and Metric: Recognizing the limitations of Vision-LLMs (VLMs) in accurately capturing spatial errors, the authors propose an automated evaluation pipeline using visual foundation models. This approach reconstructs the 3D scene geometry from images, providing a human-aligned metric of spatial faithfulness.
Empirical Evaluation: The benchmark is applied to evaluate 1,800 samples for text-to-image generation and image editing tasks across various models, including both open and closed-source frameworks. The results indicate current models' struggle with spatial understanding, highlighting key limitations and areas for improvement.

Evaluation Dimensions

Spatial Pose: This dimension evaluates models' understanding of the 3D positioning and orientation of objects and cameras. Sub-domains include object pose, camera pose, and complex pose involving multiple objects. Models often misunderstand side views, suggesting limitations in understanding camera location and orientation transformations.
Spatial Relation: Models are assessed on their ability to interpret egocentric and allocentric spatial relations between objects and their intrinsic relationships. While egocentric relationships are handled well, allocentric prompts and intrinsic relationships remain challenging due to default egocentric reasoning patterns.
Spatial Measurement: Precise spatial measurements such as object size, distance between objects, and camera distance are evaluated. Models exhibit considerable difficulty in generating scenes with specific quantitative measurements, underscoring a need for improved metric spatial reasoning.

Key Findings

The results reveal substantial limitations in spatial reasoning across state-of-the-art image generation models. Specifically, models exhibit weaknesses in:

Understanding and distinguishing camera poses.
Transforming egocentric to allocentric perspectives.
Adhering to precise spatial measurements.

Despite the advancements in visual content generation, these deficiencies highlight the need for more sophisticated spatial reasoning capabilities in AI models. Unified generative models like GPT-4o demonstrate better performance but still face challenges with specific spatial tasks.

Implications and Future Work

This paper contributes significantly to the understanding of spatial cognition within AI models, offering insights into the underlying challenges and room for improvement in spatial intelligence. The proposed benchmark and evaluation framework set a foundation for future research targeting enhanced spatial control and reasoning in image generation. Future developments may involve integrating ever-evolving visual foundation models and refining the evaluation pipeline to increase robustness and alignment with human spatial perception.

GenSpace: Benchmarking Spatially-Aware Image Generation (2505.24870v2)

Summary