- The paper demonstrates that ImmerseGen efficiently creates immersive 3D scenes using VLM-guided agents and alpha-textured proxies.
- It replaces complex high-poly models with lightweight geometries and synthesized high-resolution textures, achieving 79+ FPS on VR devices.
- Experimental results and user studies confirm superior visual quality and realism compared to traditional, computationally expensive methods.
ImmerseGen (2506.14315) is a novel framework for automatically generating immersive 3D worlds from text prompts, specifically designed for efficient real-time rendering on platforms like mobile VR headsets. Unlike traditional methods that rely on complex high-poly 3D models or massive point clouds/Gaussians which are computationally expensive for real-time VR, ImmerseGen proposes representing scenes using lightweight geometric proxies and generating photorealistic appearance by synthesizing high-resolution RGBA textures onto these proxies.
The core idea is to bypass the traditional workflow of creating detailed geometry first and then simplifying it. Instead, ImmerseGen directly creates simplified meshes (terrain, billboards, low-poly templates) and applies complex, context-aware textures that include alpha channels for detailed shapes. This allows for compact scene representations without sacrificing visual quality, as the detail is encoded in the texture rather than the mesh.
The scene generation process is guided by Visual-LLM (VLM)-based agents and structured hierarchically:
- Base World Generation:
- A base terrain mesh is retrieved from a pre-generated library based on the user's text prompt.
- A terrain-conditioned texturing scheme synthesizes panoramic sky and RGBA terrain textures. This uses a panoramic diffusion model fine-tuned on ERP data, extended with a depth-conditioned ControlNet that takes a panoramic depth map rendered from the terrain mesh.
- A key technique for robust texture generation is geometric adaptation, which remaps the rendered metric depth to better match the domain of the estimated depth used for training. This involves retrieving a similar training depth map and applying a polynomial remapping function.
- For efficient rendering on the terrain mesh, a user-centric panoramic UV mapping is precomputed. This maps mesh vertices to the panoramic texture, prioritizing resolution near the central viewpoint. Seam handling is implemented for seamless wrapping.
- Foreground areas near the user's viewpoint are further enhanced with a bottom map refinement scheme. This involves texture repainting from a top-down view using image-to-image diffusion (ControlNet Tile) and adding geometric detail via a displacement map derived from estimated depth.
- Agent-Guided Asset Generation:
- To populate the scene, lightweight proxy assets (like vegetation) are added. These are defined by distance from the user: midground assets use planar billboard textures, while foreground assets use alpha-textured cards placed over retrieved low-poly 3D template meshes.
- VLM-based agents guide this process. An asset selector analyzes the scene and prompt to retrieve suitable asset templates. An asset designer crafts detailed text prompts for texture synthesis. An asset arranger determines placement.
- To improve spatial reasoning for placement, a semantic grid-based analysis is used. The VLM agent is presented with the base world image overlaid with a labeled grid and masked unsuitable regions (e.g., water). The agent selects grid cells in a coarse-to-fine manner, and final 3D positions are determined by raycasting from the selected 2D image points onto the terrain.
- Context-aware RGBA texture synthesis generates unique textures for each placed asset. A cascaded diffusion model generates an alpha mask based on a scenery prompt. This mask is used to alpha-blend the base world background texture onto an empty canvas, which is then used as a context reference for generating the RGBA texture. A refinement module further sharpens the alpha channel.
- Multi-Modal Immersion Enhancement:
- VLM agents analyze the scene to add dynamic shader-based effects (e.g., flowing water, clouds, rain) implemented using customizable parameters and procedural textures.
- Ambient sound is synthesized by analyzing the rendered panorama and retrieving suitable soundtracks (birds, wind, water) from a library. These tracks are mixed with crossfading for seamless looping.
For practical implementation, ImmerseGen uses Blender [blender] as the core framework. The VLM agents are powered by GPT-4o. The diffusion models are based on Stable Diffusion XL [podellsdxl] and ControlNet [zhang2023adding], fine-tuned on collected panorama datasets. High-resolution 8K textures are achieved using a tile-based generation approach. Matting (using VitMatte [yao2024vitmatte] and a tile-based strategy) and sky outpainting (using Powerpaint [powerpaint]) are used for texture separation. Performance is optimized for VR by using light baking (panoramic shadow maps) and exporting scenes with unlit materials to game engines like Unity. The pipeline runs on a single NVIDIA RTX 4090, with base world generation taking around 3 minutes, asset placement around 10 seconds per asset + 1 minute for layout analysis, and immersion enhancements within 1 minute, plus 1-2 minutes for final export.
Experimental results show that ImmerseGen outperforms baselines like Infinigen [infinigen2023infinite], DreamScene360 [dreamscene360], WonderWorld [wonderworld], and LayerPano3D [yang2024layerpano3d] in terms of visual quality, realism, and rendering efficiency on VR devices (Snapdragon XR2 Gen 2). It achieves an average of 79+ FPS on VR compared to 7-14 FPS for Gaussian-based methods, while maintaining significantly lower primitive counts (223k vs. millions). User studies corroborate preferences for ImmerseGen's visual quality and realism. Ablation studies demonstrate the importance of geometric adaptation for terrain texturing and the efficacy of the semantic grid-based analysis for asset placement.
Limitations include focusing mainly on outdoor scenes, a limited exploration range due to fixed generation levels, and reliance on pre-built templates for foreground geometry, which could be improved by integrating procedural generation techniques.