HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models

Published 12 Apr 2026 in cs.CV | (2604.10772v1)

Abstract: 3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with LLMs and vision-LLMs (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while supporting fast and intuitive scene editing.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a multi-module framework that fuses LLM-driven scene planning with VLM-based spatial reasoning for coherent text-to-3D scene generation and editing.
It employs a physics-inspired hierarchical optimization mechanism that iteratively refines layouts using force-directed updates to enforce both physical and semantic constraints.
Quantitative and qualitative evaluations demonstrate state-of-the-art performance on semantic plausibility, collision minimization, and real-time interactive editing.

HOG-Layout: Hierarchical 3D Scene Generation, Optimization, and Editing via Vision-LLMs

Introduction

The HOG-Layout framework presented in "HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-LLMs" (2604.10772) offers a comprehensive solution for text-driven 3D indoor scene generation and real-time editing. The system synergizes LLMs with vision-LLMs (VLMs) under a retrieval-augmented generation (RAG) paradigm, combined with a physics-inspired hierarchical layout optimization mechanism. The major contributions include: (i) a multi-layered scene generation and editing pipeline that leverages LLMs for high-level planning and VLMs for spatial reasoning, (ii) a hierarchical optimization scheme for enforcing both physical and semantic plausibility, and (iii) robust and low-latency editing capabilities guided by natural language.

HOG-Layout System Overview

The HOG-Layout pipeline consists of four principal modules: scene planning, layout generation, hierarchical optimization, and scene editing.

Scene Planning employs an LLM (with RAG) to transform free-form text instructions into structured scene plans, including object lists, room specifications, and functional group segmentation. Relevant design rules are dynamically retrieved from a template rule library (vectorized via Qwen3-Embedding-4B and managed in FAISS), ensuring context-aware layout consistency without overfitting to dataset idiosyncrasies.

Layout Generation utilizes a VLM that takes the structured plan, relevant constraints, and a top-down visual scene representation with grid and coordinates to predict object placement and relationships. Objects are sequentially placed by functional groups—each group's layout is conditioned on the optimized scene generated so far, supporting complex multi-zone environments.

Hierarchical Optimization iteratively refines the layout by mapping constraints (physical: collision, boundary, support; semantic: adjacency, orientation, alignment) to continuous force models. By decomposing the scene into hierarchical parent-child and same-level relationships, the optimizer performs multi-resolution force-directed updates, balancing computational efficiency and stability. Physics constraints are abstracted as forces with explicit weights, tuned through hyperparameter search (Optuna, TPE, Hyperband) for convergence rates and robustness.

Scene Editing is entirely text-driven. Free-form instructions are parsed by the LLM into atomic operations (add, move, delete, plan). The editing system supports precise object selection, placement, and attribute modification, integrating visual context and optimized constraints for immediate scene updates. The editing pipeline achieves interactive speeds, with move and delete operations typically below 20 seconds even in complex scenarios.

Hierarchical Optimization and Physics-based Reasoning

HOG-Layout's optimizer abstracts high-level semantics and low-level physical interactions as unified force models. Spatial relationships are encoded as planar and vertical forces plus rotational torques, which are accumulated and applied via explicit Euler steps. Notably, the optimizer introduces robust deadlock detection and resolution—if objects are stuck due to force equilibria (local minima), additional orthogonal or scaling perturbations are introduced to enhance convergence.

The force decomposition includes:

Horizontal plane: collision (same-level), adjacency, against-wall, boundary, support
Vertical axis: multi-level collision, vertical boundary, support
Rotation: alignment, pointing, orientation

All constraint types are assigned tuned weights; scene state converges when total residual force falls below a threshold or maximum iterations are reached. This hierarchical multi-resolution approach outperforms classic gradient-based optimizations in both speed and physical plausibility.

Modular Asset Retrieval and Generation

For object instantiation, a multimodal retrieval pipeline ranks candidates by semantic similarity (SBERT), visual similarity (OpenCLIP multi-view scoring), and geometric compatibility. The final retrieved asset is seamlessly aligned to the planned location, ensuring physical plausibility through size-normalized measurements. The framework is agnostic to the source of 3D assets—it supports both retrieval from large curated databases (ObjaVerse, Holodeck) and on-the-fly generation via text-to-image-to-3D (e.g., DALL-E + Hunyuan 3D). While generative pipelines offer greater novelty, retrieval supports real-time interaction.

Quantitative and Qualitative Evaluation

Benchmarks and Metrics

HOG-Layout is evaluated on the SceneEval benchmark, which offers fine-grained fidelity and plausibility metrics, including object count (CNT), attribute (ATR), scene-object and object-object relationship scores (OAR, OOR), collision, support, boundary, navigability, semantic plausibility (SP, via GPT-5), and CLIP-based alignment. The editing module is tested with a variety of complex instructions across simple and compound scenes.

Results

HOG-Layout outperforms LayoutGPT, Holodeck, and LayoutVLM across virtually all semantic and physical metrics. It achieves the highest SP (69.7) and CLIPsim (18.6) scores, minimal collision (5.3%) and OOB rates (2.5%), and offers fastest editing among physically-plausible pipelines. It preserves scene fidelity and diversity while guaranteeing low-level geometric constraints—a bottleneck for prior methods that either focused solely on physical feasibility (Holodeck) or semantic alignment (LayoutGPT) without integration.

Human evaluation aligns with automated scoring, with HOG-Layout rated highest for both plausibility and instruction fidelity.

Editing operation success is robust: add and delete reach 100% success (zero new collisions/out-of-bounds), while move achieves 80% (with occasional failures attributed to ambiguous linguistic instructions). Integration of semantic retrieval and hierarchical optimization modules is critical to these outcomes, as demonstrated by ablations.

Theoretical and Practical Implications

HOG-Layout demonstrates that explicit hierarchical modeling and constraint decomposition are essential for scalable, generalizable scene reasoning under open-vocabulary, multimodal instructions. Hybrid approaches that fuse LLM-driven planning, VLM spatial reasoning, data-agnostic semantic retrieval, and physics-inspired optimization provide strong physical guarantees without sacrificing generative diversity. The plug-and-play asset retrieval/generation interface opens avenues for vast open-domain scene synthesis, including stylized, rare, or user-specific content not present in training datasets.

On the practical side, HOG-Layout is a viable backend for embodied AI simulation, interactive interior design, VR/AR content generation, and robotics, where semantically aligned, physically plausible environments are required under naturalistic, possibly ambiguous language control.

Future Directions

Open technical challenges include extending the system for outdoor or non-architectural environments, allowing for dynamic or deformable objects, integrating user-feedback-driven optimization, and leveraging end-to-end differentiable modules for constraint learning and scene synthesis. Furthermore, coupling scene reasoning with downstream embodied tasks (navigation, manipulation, human interaction) will stress test the generalization and reasoning limits of current VLM+LLM architectures.

Conclusion

HOG-Layout introduces a robust, modular framework for text-to-3D scene synthesis and editing, leveraging vision-language reasoning, retrieval-augmented LLM planning, and hierarchical physical optimization (2604.10772). Results show state-of-the-art semantic and physical fidelity, interactive editing, and extensibility to generative workflows. HOG-Layout sets a high bar for actionable, intuitive, and physically guaranteed scene design modules, providing a blueprint for future embodied intelligence research.

Markdown Report Issue