Elevate3D: High-Quality 3D Mesh Refinement
- Elevate3D is a two-stage framework that iteratively refines 3D meshes by alternately enhancing textures with HFS-SDEdit and updating geometry using monocular normal predictions.
- The method employs a view-by-view pipeline where unrefined regions are targeted for diffusion-based texture enhancement and precise depth integration via Poisson reconstruction.
- Empirical evaluations demonstrate that Elevate3D significantly outperforms prior models on perceptual and full-reference quality metrics, highlighting its robust asset improvement capabilities.
Elevate3D is a two-stage, view-by-view refinement framework designed to transform low-quality textured 3D meshes into high-quality 3D assets. It addresses the scarcity of high-quality 3D models in computer graphics and 3D vision by alternately enhancing texture and geometry through a novel pipeline. At the core of Elevate3D is HFS-SDEdit, a frequency-aware diffusion-based texture enhancer that operates in tandem with a geometry refinement process driven by monocular normal (or depth) predictions. The result is a model that systematically enforces multi-view consistency and aligns geometry with enhanced texture, outperforming recent alternatives on both perceptual and full-reference quality metrics (Ryu et al., 15 Jul 2025).
1. Pipeline Overview
Elevate3D operates iteratively over a set of virtual camera viewpoints . At each iteration , the pipeline processes a partially refined mesh through the following steps:
- Rendering: is rendered from camera to obtain an image and a binary mask indicating “unrefined” pixels unseen in previous views.
- Texture Enhancement (HFS-SDEdit): Unrefined regions identified by are refined using HFS-SDEdit, which synthesizes with improved texture quality by selectively updating low-frequency image components, while preserving original high-frequency detail.
- Normal Estimation: A monocular normal predictor estimates a normal map from 0.
- Depth Integration and Fusion: The predicted normals are used to reconstruct a small, reliable depth patch via regularized normal integration and then fused into 1 using Poisson surface reconstruction, resulting in 2.
- Texture Projection: The refined image 3 is projected onto 4 with occlusion-aware, normal-weighted blending, producing 5 for the next view.
This pipeline proceeds until the set of camera views covers the object to a predefined threshold (unrefined coverage 6).
2. View-by-View Alternating Refinement Strategy
Elevate3D’s core loop alternates between two major operations for each viewpoint:
- Texture Refinement with HFS-SDEdit:
- The current mesh is rendered to produce RGB image 7 and normals.
- A mask 8 identifies pixels not previously refined.
- HFS-SDEdit refines 9 in unrefined regions, producing 0, by leveraging a high-frequency-swap mechanism during diffusion sampling, which retains detailed structure while permitting enhancement of global appearance.
- Blending ensures that only eligible pixels from 1 are incorporated into 2’s texture.
- Geometry Refinement:
- Monocular normal estimation produces 3 from 4.
- An orthographically-rasterized depth map 5 is computed from 6.
- A regularized energy minimization
7
yields a corrected depth field 8. - Unreliable regions are filtered using bilateral weights and 9 morphological erosion. - Valid depth patches are fused into 0 with Poisson reconstruction. - The updated 1 is projected with occlusion and normal-based weighting, ensuring stable, consistent texture–geometry alignment.
This interleaving design guarantees that fresh geometric updates only impact texture-stabilized regions, while texture refinement never overwrites previously updated views.
3. Mathematical and Algorithmic Foundations
Elevate3D builds on both diffusion-based image synthesis and geometry processing:
- HFS-SDEdit introduces a per-step high-frequency swap to a pretrained UNet diffusion model (FLUX rectified flow), with no new trainable parameters or additional losses:
2
for initialization, and during sampling steps,
3
where 4 is a Gaussian low-pass, and 5/ 6 are noisy/denoised latents. This locks high-frequencies to the original image, while enabling lower frequencies to adaptively match the learned distribution.
- Masked Blending in texture refinement:
7
where 8 is a downsampled refinement mask.
- Regularized Normal Integration for geometry: The energy 9 simultaneously aligns surface gradients from predicted normals with depth changes, and regularizes to the previous mesh. The global geometry is stably updated by bilaterally weighted, erosion-filtered patch selection and Poisson re-integration.
4. Implementation and Evaluation
- Texture Sampling Details: The backbone is FLUX rectified-flow large diffusion with 0 steps, initial noise step 1, swap stopping at 2, and Gaussian smoothing 3.
- Geometry Prediction: Employs off-the-shelf monocular normal predictors (e.g., Mari-E2E), with Cao et al.’s Bini surface scheme for depth integration and 4 regularization.
- View Schedule: Initial views use 5 elevations (6) and 7 azimuths, then subsequent views maximize remaining unrefined texture using cosine-weighted selection. Iteration continues until the unrefined region is below 8.
- Training: HFS-SDEdit and normal predictors are used without further training or fine-tuning. No extra augmentation is performed.
Quantitatively, on 59 degraded GSO scans (with 9 face decimation and Gaussian blur), Elevate3D outperforms DreamGaussian, DiSR-NeRF, and MagicBoost by significant margins:
| Method | MUSIQ ↑ | LIQE ↑ | TOPIQ ↑ | Q-Align ↑ |
|---|---|---|---|---|
| DreamGaussian | Ref | Ref | Ref | Ref |
| DiSR-NeRF | Ref | Ref | Ref | Ref |
| MagicBoost | Ref | Ref | Ref | Ref |
| Elevate3D | +5–18 | +0.6–1.5 | +0.06–0.14 | +0.5–0.7 |
On LSDIR image restoration, HFS-SDEdit achieves LPIPS 0, MUSIQ 1, LIQE 2, TOPIQ 3, consistently outperforming SDEdit and NC-SDEdit.
Ablation studies reveal that omitting geometry refinement leads to high-quality textures on an unaltered coarse mesh, while omitting texture refinement impairs geometric improvement, and removing the normal-integration regularizer causes severe mesh distortion. Application to TRELLIS-generated models demonstrates substantial qualitative improvement in real-world scene sharpness.
5. Limitations and Future Directions
Elevate3D’s primary bottleneck lies in the necessity of processing each view sequentially with diffusion-based sampling, causing linear runtime scaling with the number of views (approximately 4 minutes for 5–6 views on an RTX A6000). Prospective advancements could incorporate fast samplers (e.g., SD3 Turbo) or multi-view amortized strategies to reduce computational burden.
Another limitation is the reliance on monocular normal prediction: highly specular or textureless areas can degrade prediction quality, although the energy-based regularization mitigates drastic artifacts. A plausible implication is that extending geometry refinement to optimize mesh topology (e.g., dynamic remeshing) or integrating neural implicit representations may further enhance detail and fidelity alignment.
6. Significance and Related Work
Elevate3D distinguishes itself by interleaving a high-fidelity, high-quality texture updater (HFS-SDEdit) with a geometry updater grounded in monocular normal cues and strong regularization, using a view-by-view pipeline. This strategy ensures both multi-view consistency and alignment between texture and geometry–two aspects underaddressed by earlier methods.
Compared to prior workflows such as DreamGaussian (texture-only), DiSR-NeRF, MagicBoost, and TRELLIS outputs—which often neglect geometry refinement or rely solely on texture updating—Elevate3D’s joint refinement mechanism delivers production-level 3D assets from coarse scans or generative sources, without additional training or fine-tuning (Ryu et al., 15 Jul 2025).