Dreamer XL: Towards High-Resolution Text-to-3D Generation via Trajectory Score Matching (2405.11252v1)

Published 18 May 2024 in cs.CV

Abstract: In this work, we propose a novel Trajectory Score Matching (TSM) method that aims to solve the pseudo ground truth inconsistency problem caused by the accumulated error in Interval Score Matching (ISM) when using the Denoising Diffusion Implicit Models (DDIM) inversion process. Unlike ISM which adopts the inversion process of DDIM to calculate on a single path, our TSM method leverages the inversion process of DDIM to generate two paths from the same starting point for calculation. Since both paths start from the same starting point, TSM can reduce the accumulated error compared to ISM, thus alleviating the problem of pseudo ground truth inconsistency. TSM enhances the stability and consistency of the model's generated paths during the distillation process. We demonstrate this experimentally and further show that ISM is a special case of TSM. Furthermore, to optimize the current multi-stage optimization process from high-resolution text to 3D generation, we adopt Stable Diffusion XL for guidance. In response to the issues of abnormal replication and splitting caused by unstable gradients during the 3D Gaussian splatting process when using Stable Diffusion XL, we propose a pixel-by-pixel gradient clipping method. Extensive experiments show that our model significantly surpasses the state-of-the-art models in terms of visual quality and performance. Code: \url{https://github.com/xingy038/Dreamer-XL}.

References (43)

Authors (7)

Xingyu Miao (11 papers)
Haoran Duan (36 papers)
Varun Ojha (21 papers)
Jun Song (89 papers)
Tejal Shah (11 papers)
Yang Long (61 papers)
Rajiv Ranjan (66 papers)

Citations (2)

View on Semantic Scholar

Summary

Understanding Trajectory Score Matching for Improved Text-to-3D Generation

Introduction

Creating high-quality 3D content from natural language descriptions has become a significant focus in various fields, from virtual and augmented reality to gaming and animation. The primary challenge has been generating accurate 3D models without extensive manual effort. This paper introduces a new method called Trajectory Score Matching (TSM), which addresses some limitations of existing methods in generating high-quality 3D models using text descriptions.

Key Ideas

Background on Existing Methods

Most text-to-3D generation methods leverage pre-trained text-to-image diffusion models. These models help in training 3D neural representations, such as Neural Radiance Fields (NeRF) and 3D Gaussian splatting, to create 3D objects described by text. A common approach is Score Distillation Sampling (SDS), which optimizes the 3D model by distilling knowledge from pre-trained 2D models. However, SDS often results in over-smoothed and averaged outputs, which lack the desired detail and clarity.

Interval Score Matching (ISM) to the Rescue

One noteworthy improvement over SDS is the Interval Score Matching (ISM) technique introduced by previous research. ISM helps in maintaining better consistency by reducing over-smoothing. It does this by generating pseudo-ground truths through the DDIM (Denoising Diffusion Implicit Models) inversion process. However, ISM still suffers from accumulated errors inherent in the DDIM inversion process, leading to inconsistency in some regions of the generated 3D model.

Trajectory Score Matching (TSM)

To mitigate these accumulated errors, the paper introduces Trajectory Score Matching (TSM). TSM generates two paths from the same starting point during the DDIM process. By following these dual paths, TSM reduces the inconsistencies that the single-path ISM method struggles with. Here are the key steps involved:

Starting Point Generation: Like ISM, TSM uses the DDIM inversion to predict noisy latents.
Dual Path Creation: From the same initial point, TSM calculates two paths, one with less noise (x_μ), and the other with more noise (x_t).
Minimizing Errors: By comparing these two paths, TSM aims to ensure that the generated 3D model remains consistent and detailed, avoiding the cumulative error issues seen in ISM.

Practical Enhancements

Stable Diffusion XL (SDXL) for Guidance: Another innovative aspect of this paper is leveraging the high-resolution capabilities of Stable Diffusion XL (SDXL). Unlike previous models that are limited to lower resolutions, SDXL can generate 1024x1024 image outputs. This significantly improves the quality of the 3D models.

Pixel-by-Pixel Gradient Clipping: The paper also addresses gradient instability problems during the 3D Gaussian splatting process when using SDXL. They propose a novel pixel-by-pixel gradient clipping method, ensuring that the new high-resolution models maintain clarity and detail without introducing artifacts.

Results

The experimental results showcased in the paper are quite compelling, with TSM significantly outperforming other state-of-the-art methods. Here are some highlights:

Visual Quality: The results generated by Dreamer XL using TSM appear crisper and more detailed compared to previous methods. For instance, the intricate details in objects and character textures are rendered more accurately.
Consistency: TSM shows fewer inconsistencies and maintains better semantic alignment with the input text prompts compared to ISM, which often shows averaging effects.
Quantitative Metrics: The paper reports an improved CLIP-Score of up to 0.297 and a reduction in A-LPIPS, indicating better artifact mitigation in 3D models.

Implications and Future Work

The introduction of TSM and the novel gradient clipping technique opens new doors for more accurate and high-resolution 3D content generation from text. This can have profound impacts on fields like virtual reality, gaming, and digital content creation by enabling faster and more scalable 3D modeling processes.

Future Directions:

Integration into Commercial Tools: Given its promising results, TSM could be incorporated into commercial 3D modeling tools to streamline the content creation pipeline.
Further Optimization: While TSM is a significant step forward, further optimization and scaling could make it even more efficient.
Exploration of Other High-Resolution Diffusion Models: Future research could explore other advanced high-resolution diffusion models to see if they can provide even better guidance for 3D generation.

Conclusion

The introduction of Trajectory Score Matching (TSM) addresses a critical limitation in existing text-to-3D generation methods, providing a more consistent and detailed output. By reducing accumulated errors through a dual-path approach and leveraging advanced techniques like pixel-by-pixel gradient clipping, this paper sets a new standard in the field of 3D content generation. While there is still room for improvement, TSM marks a substantial advancement towards more efficient and accurate 3D modeling from text.

PDF Markdown

Related Papers

Tweets

https://twitter.com/janusch_patas/status/1792775019388448916

https://twitter.com/_akhaliq/status/1792748004811719026

https://twitter.com/CSVisionPapers/status/1792921763040338178

Reddit

[2405.11252] Dreamer XL: Towards High-Resolution Text-to-3D Generation via Trajectory Score Matching (1 point, 1 comment)