Visual-Guided 3D Scene Mesh Generation

Updated 16 December 2025

Visual-guided 3D scene mesh generation is a process that creates detailed, textured 3D meshes from limited visual inputs using neural and geometric priors.
It employs techniques like depth map estimation, masked autoencoder completion, and implicit neural radiance fields to ensure consistency across views.
Integrating adversarial refinement and mesh extraction, the method achieves high-fidelity, globally consistent outputs suitable for graphics and robotics applications.

Visual-guided 3D scene mesh generation refers to the automatic creation of detailed, textured 3D scene meshes from visual signals—including single or multi-view images, video, depth maps, or text prompts that are rendered to images—using neural and/or geometric priors. Recent advances focus on producing globally consistent, high-fidelity meshes from limited visual input by integrating explicit geometric reasoning, implicit neural fields, and visual consistency enhancement mechanisms. This article presents a technical overview of the state-of-the-art in this area, describing representative frameworks, underlying mathematical principles, pipeline architectures, and performance characteristics, with a particular focus on the Scene123 system (Yang et al., 10 Aug 2024) and its context within the evolving research landscape.

1. Foundational Principles of Visual-guided 3D Scene Mesh Generation

Visual-guided 3D scene mesh generation solves the ill-posed inverse problem of inferring spatially complete, consistent, and photorealistic 3D geometry and appearance from incomplete visual signals. The principal methodological challenge is guaranteeing inter-view and global consistency when synthesizing novel, unseen regions, especially when only a single view or prompt is available. This problem is characterized by:

View blindness: Many scene regions are occluded or never visible in the input.
Consistency: Synthesized views must align photometrically, semantically, and geometrically.
High dimensionality: The output mesh must capture not only object shape, but complex spatial layouts, topologies, and textures on a scene scale.

Solutions exploit visual priors from pretrained diffusion/video models, structured neural representations (e.g., Neural Radiance Fields/NeRFs), explicit geometric estimation, and inter-view consistency constraints.

2. Video-Assisted Multi-View Generation and Masked Autoencoder Completion

Frameworks such as Scene123 (Yang et al., 10 Aug 2024) employ a video-assisted generation stage to bootstrap multi-view support from a single image or text-derived image. The core pipeline comprises:

Depth Map Estimation: Given input image $I_0$ , depth $D_0$ is predicted using a monocular estimator (e.g., LeReS).
Depth-Image-Based Rendering (DIBR) for view synthesis:

$[x_{0\to m},\,y_{0\to m}]^T = K\,P_m\,P_0^{-1}\,K^{-1}\,[x,y]^T$

where $K$ denotes intrinsics, and $P$ represents the $4 \times 4$ camera pose.

Missing Region Inpainting: DIBR-induced holes are filled by a masked autoencoder (MAE) leveraging a global VQ-codebook $E$ $E$ for information sharing between all views.
- For each masked view $I_i$ , the MAE:
- Encodes the masked image into latent tokens.
- Cross-attends to shared codebook $E$ .
- Decodes to yield hole-filled $\hat{I}_i$ .

Consistency across inpainted regions is further improved by adding a codebook-shared cross-view regularization term and, optionally, a view-overlap consistency loss: $L_{\rm consist} = \sum_{i<j} \| (1-M_i)\odot(1-M_j)\odot (\hat{I}_i - \hat{I}_j) \|_1$

3. Implicit Neural Field Optimization and Consistency Losses

The synthesized images and estimated depths are then used as support for an implicit neural radiance field (NeRF), producing dense, view-consistent volumetric reconstructions. The NeRF learns a mapping: $F_\phi(x,d) = (\sigma(x), c(x,d)), \quad x\in\mathbb R^3,\ d\in S^2$ yielding volume density $\sigma(x)$ and color $c(x,d)$ at each location and view direction.

Classical volume rendering is performed: $C(r) = \int_0^\infty T(t)\,\sigma(r(t))\,c(r(t),d)\;dt, \quad T(t) = \exp\Big( -\int_0^t \sigma(r(s))\,ds \Big)$ A photometric consistency loss over support views ties the NeRF output to the masked autoencoder's completions: $L_{\rm photo} = \sum_i \lVert C_i - \hat{I}_i \rVert_1$ Depth and transmittance losses, along with GAN-based texture refinement (described below), further enhance geometry and appearance.

4. Adversarial Refinement and Texture Fidelity

Scene123 employs a GAN-based adversarial objective to close the gap between NeRF-synthesized renderings and high-frequency textures of multi-view video diffusion outputs: $L_{\rm GAN} = \mathbb{E}_{x\sim p_{\rm data}} [\log D(x)] + \mathbb{E}_{c\sim p_{\rm NeRF}} \left[\log (1 - D(c))\right]$ Here, $p_{\rm data}$ denotes the distribution of video-generated images, and $p_{\rm NeRF}$ the rendered NeRF images. The adversarial signal compels NeRF to match the statistics of real RGB video sequences, enhancing albedo, texture sharpness, and visual plausibility in the resulting mesh.

5. Mesh Extraction and Texturing

After the radiance field is optimized, explicit mesh extraction proceeds by thresholding the learned density field and applying the Marching Cubes algorithm: $\{ x \mid \sigma(x) > \sigma_{\rm thresh} \}$ Standard geometric post-processing (e.g., Laplacian smoothing) and parametric UV-unwrapping (e.g., xatlas) are used to enable surface texture baking by reprojecting per-vertex colors from the NeRF or support images.

Resulting outputs are watertight, textured triangle meshes suitable for downstream graphics or robotics engines.

6. Broader Context: Variants, Applications, and Comparative Evaluation

Visual-guided 3D scene mesh generation has evolved rapidly to encompass single-view, multi-view, panoramic, and prompt-based scene reconstruction:

Matrix-3D (Yang et al., 11 Aug 2025): Integrates panoramic video diffusion and feed-forward panoramic 3D reconstruction, with an optional optimization-based fidelity-refinement pipeline, achieving high PSNR/SSIM/LPIPS for explorable world meshes.
Layout2Scene (Chen et al., 5 Jan 2025): Decouples scene into hybrid Gaussian/mesh representations, combines explicit 3D semantic layouts as prompts, and applies dual geometry/appearance diffusion with ControlNet guidance.
MeshFormer (Liu et al., 19 Aug 2024): Leverages explicit sparse voxels, projection-aware attention, surface normal guidance, and SDF supervision for robust mesh learning from limited visual observations.
PBR3DGen (Wei et al., 14 Mar 2025): Extends mesh generation to include per-vertex physically-based rendering (PBR) material maps, using vision-LLM priors in the multi-view diffusion stage for material/lighting disentanglement and NeuS for geometry.
Evoscene (Zheng et al., 9 Dec 2025): Introduces an iterative, self-evolving loop alternating spatial prior point cloud estimation, visual-guided mesh refinement, and spatially-conditioned video diffusion to progressively complete and refine scene geometry and appearance.

Quantitative benchmarks (e.g., novel-view image PSNR/SSIM/LPIPS, FID, CLIP-similarity, mesh Chamfer/F-score, user preference rates) show that these methods consistently close the realism-diversity-consistency gap compared to earlier pipelines. For example, Scene123 surpasses prior state-of-the-art in both qualitative and quantitative evaluations (Yang et al., 10 Aug 2024), and EvoScene achieves >80% win rates versus strong baselines (Zheng et al., 9 Dec 2025).

7. Future Directions and Limitations

Despite significant advances, challenges remain:

Incomplete or unreliable depth cues can propagate errors into geometry completion and texture synthesis (Yang et al., 10 Aug 2024, Zheng et al., 9 Dec 2025).
Multimodal generation (handling highly specular, transparent, or dynamic content) is limited by priors of current diffusion or radiance field models (Wei et al., 14 Mar 2025, Zheng et al., 9 Dec 2025).
Scaling to real-time or large-scale unbounded scenes requires efficient spatial partitioning, crack-free mesh extraction (e.g., OcMesher (Ma et al., 2023)), and hybrid mesh-Gaussian models for optimizing speed vs. fidelity trade-offs (Huang et al., 8 Jun 2025).
Editable scene representation, global physical plausibility (contact, collision, support), and explicit semantic/interaction reasoning remain open research problems (Weng et al., 2020, Chen et al., 5 Jan 2025).

Ongoing efforts target the integration of semantic layout guidance, more powerful vision-language interaction, and improved fusion of 2D/3D priors for holistic, editable 3D scene mesh generation at scale.

References:

Scene123 (Yang et al., 10 Aug 2024) Matrix-3D (Yang et al., 11 Aug 2025) EvoScene (Zheng et al., 9 Dec 2025) MeshFormer (Liu et al., 19 Aug 2024) PBR3DGen (Wei et al., 14 Mar 2025) Layout2Scene (Chen et al., 5 Jan 2025) Hybrid Mesh–Gauss (Huang et al., 8 Jun 2025) OcMesher (Ma et al., 2023) Holistic 3D Human/Scene (Weng et al., 2020) Pixel2Mesh++ (Wen et al., 2022) MeshMVS (Shrestha et al., 2020) Smooth Mesh via Convex Optimization (Rosinol et al., 2021) Incremental VI Mesh (Rosinol et al., 2019)