SynCity: Training-Free Generation of 3D Worlds (2503.16420v1)

Published 20 Mar 2025 in cs.CV

Abstract: We address the challenge of generating 3D worlds from textual descriptions. We propose SynCity, a training- and optimization-free approach, which leverages the geometric precision of pre-trained 3D generative models and the artistic versatility of 2D image generators to create large, high-quality 3D spaces. While most 3D generative models are object-centric and cannot generate large-scale worlds, we show how 3D and 2D generators can be combined to generate ever-expanding scenes. Through a tile-based approach, we allow fine-grained control over the layout and the appearance of scenes. The world is generated tile-by-tile, and each new tile is generated within its world-context and then fused with the scene. SynCity generates compelling and immersive scenes that are rich in detail and diversity.

Summary

The paper introduces SynCity, a training-free framework generating large-scale 3D worlds from text by combining pre-trained 3D generative models with 2D image generators.
A key innovation is the tile-based auto-regressive generation process, which discretizes space and uses prompt engineering for localized control and global coherence.
Geometric validation, 3D Gaussian refinement, and blending techniques are applied post-generation to ensure spatial consistency and seamless integration of tiles.

Overview

The "SynCity: Training-Free Generation of 3D Worlds" framework presents a novel paradigm for constructing large-scale, navigable 3D environments directly from textual descriptions. The approach strategically exploits the geometric rigor of pre-trained 3D generative models in conjunction with the visual versatility of 2D image generators, circumventing the customary need for fine-tuning and optimization. Rather than training a domain-specific network, the method leverages existing models (e.g., TRELLIS for 3D reconstruction and Flux for 2D image synthesis) through sophisticated prompt engineering and a robust tile-based synthesis pipeline. This architecture facilitates precise, coherent multi-tile generation and seamless integration into comprehensive 3D scenes.

Pre-trained Models and Architectural Integration

SynCity employs two distinct classes of generative models:

3D Generative Models (TRELLIS): These models are utilized beyond typical object-centric reconstruction. TRELLIS provides geometric fidelity and inherent shape regularity, delivering high-quality, localized 3D representations. Here, each generated tile records accurate depth and structure, ensuring consistency with neighboring tiles.
2D Image Generators (Flux): The system exploits the expressive quality of pre-trained 2D generators. Flux is conditioned on textual prompts to produce rich, isometric views. The inherent artistic capability of Flux is integrated to complement the 3D geometric scaffolding provided by TRELLIS.

This dual-model paradigm, implemented without additional training, represents a significant reduction in computational overhead and data dependence relative to traditional 3D scene synthesis methods.

Tile-Based Generation and Prompt Engineering

A central innovation in SynCity is its tile-based approach:

Grid Decomposition: The 3D space is discretized into a grid of square tiles. Each tile encapsulates a localized section of the larger world, enabling both independent and context-aware generation.
Auto-regressive Generation: Tiles are synthesized in an auto-regressive manner whereby the textual description is decomposed by a LLM into finely grained tile-specific prompts. These prompts are generated alongside a high-level, world-specific style prompt, ensuring global coherence.
Isometric Framing & Inpainting: For the 2D generator, a novel isometric conditioning strategy is adopted. Prior to generating a tile, the system conditions Flux with a base image template—a grey slab rendered in isometric view—and an inpainting mask. This approach guarantees that each tile, when generated, maintains orientation consistency relative to its neighbors.
Contextual Awareness: Each new tile is generated within the context of its adjacent, previously generated tiles. This contextual conditioning is crucial for stitching and spatial alignment, reinforcing global scene coherence.

The tile-specific prompt engineering not only reduces the complexity inherent in high-dimensional scene generation but also allows fine-grained control over local textures while preserving overall thematic and stylistic consistency.

Geometric Validation, 3D Reconstruction, and Blending Techniques

In order to ensure high-quality synthesis and seamless tile integration, SynCity introduces several post-processing and validation steps:

3D Geometric Validation: Each rendered tile undergoes a geometric validation process. Heuristics validate critical properties such as correct dimensionality, proper alignment of geometric primitives, and adherence to the expected scale. This ensures that only tiles meeting stringent geometric criteria contribute to the final scene.
Gaussian Representation Refinement: Post-processing involves refining the 3D Gaussian representations. This step entails cropping out any hallucinated base regions, rescaling tile outputs to a canonical unit size, and reorienting them to match the 2D prompt's perspective.
3D Blending and Boundary Inpainting: The final integrated 3D world leverages a 3D blending technique to merge individual tiles. Boundary regions are re-synthesized in the latent space using 2D inpainting strategies that reconcile discrepancies between adjacent tile features, yielding a homogeneous and visually consistent scene.

These techniques, particularly the 3D blending and geometric post-processing, are critical to mitigating artifacts typically associated with tile-based generation, such as seam misalignments or discontinuities across tile boundaries.

Quantitative and Qualitative Outcomes

Though the work emphasizes a training-free methodology, SynCity demonstrates compelling results via quantitative and qualitative analyses:

Scalability: The tile-based synthesis method is inherently scalable. By increasing the grid resolution, the system is capable of generating arbitrarily large worlds without a proportional increase in computational cost, as each tile generation is localized.
Coherence and Immersion: Quantitative metrics such as Intersection-over-Union (IoU) for tile overlaps and perceptual similarity indices illustrate the high degree of spatial coherence and realism achievable by the system. Qualitatively, the framework produces immersive navigable 3D environments that maintain both global thematic consistency and rich local detail.
Artifact Reduction: Detailed analyses reveal that the isometric inpainting approach significantly reduces common issues found in previous methods, such as misaligned geometric aberrations and texture discontinuities across tile boundaries.

Conclusion

SynCity represents a robust framework for training-free 3D world synthesis by effectively integrating pre-trained 3D generative models and 2D image generation techniques. Its tile-based approach, enhanced by prompt engineering and rigorous post-processing, enables the generation of expansive and coherent 3D environments from textual prompts. This methodology significantly reduces the reliance on large-scale training datasets while maintaining high geometric fidelity and visual quality. The architectural design choices, including auto-regressive tile generation and 3D boundary blending, offer a practical solution for applications in immersive simulation, virtual reality, and large-scale 3D content creation.