Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dreamer XL: Towards High-Resolution Text-to-3D Generation via Trajectory Score Matching (2405.11252v1)

Published 18 May 2024 in cs.CV

Abstract: In this work, we propose a novel Trajectory Score Matching (TSM) method that aims to solve the pseudo ground truth inconsistency problem caused by the accumulated error in Interval Score Matching (ISM) when using the Denoising Diffusion Implicit Models (DDIM) inversion process. Unlike ISM which adopts the inversion process of DDIM to calculate on a single path, our TSM method leverages the inversion process of DDIM to generate two paths from the same starting point for calculation. Since both paths start from the same starting point, TSM can reduce the accumulated error compared to ISM, thus alleviating the problem of pseudo ground truth inconsistency. TSM enhances the stability and consistency of the model's generated paths during the distillation process. We demonstrate this experimentally and further show that ISM is a special case of TSM. Furthermore, to optimize the current multi-stage optimization process from high-resolution text to 3D generation, we adopt Stable Diffusion XL for guidance. In response to the issues of abnormal replication and splitting caused by unstable gradients during the 3D Gaussian splatting process when using Stable Diffusion XL, we propose a pixel-by-pixel gradient clipping method. Extensive experiments show that our model significantly surpasses the state-of-the-art models in terms of visual quality and performance. Code: \url{https://github.com/xingy038/Dreamer-XL}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  2. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22246–22256, 2023.
  3. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  4. Dynamic unary convolution in transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):12747–12759, 2023.
  5. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  6. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022.
  7. Zero-shot text-guided object generation with dream fields. arXiv, December 2021.
  8. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023.
  9. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023.
  10. Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching. arXiv preprint arXiv:2311.11284, 2023.
  11. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
  12. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv preprint arXiv:2311.07885, 2023.
  13. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems, 36, 2024.
  14. Zero-1-to-3: Zero-shot one image to 3d object, 2023.
  15. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
  16. Geodream: Disentangling 2d and geometric priors for high-fidelity and consistent 3d generation. arXiv preprint arXiv:2311.17971, 2023.
  17. Latent-nerf for shape-guided generation of 3d shapes and textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12663–12673, 2023.
  18. Conrf: Zero-shot stylization of 3d scenes with conditioned radiation fields, 2024.
  19. Text2mesh: Text-driven neural stylization for meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13492–13502, 2022.
  20. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  21. Freewalk: A 3d virtual space for casual meetings. IEEE MultiMedia, 6(2):20–28, 1999.
  22. Enhancing high-resolution 3d generation through pixel-wise gradient clipping. In International Conference on Learning Representations (ICLR), 2024.
  23. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  24. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  25. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  26. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  27. 3d virtual learning environments in education: A meta-review. Asia Pacific Education Review, 18:81–100, 2017.
  28. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  29. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  30. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023.
  31. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  32. Denoising diffusion implicit models. arXiv:2010.02502, October 2020.
  33. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  34. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  35. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  36. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023.
  37. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021.
  38. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems, 36, 2024.
  39. 3d visualisation for online retail: factors in consumer behaviour. International Journal of Market Research, 58(3):451–472, 2016.
  40. Consistent3d: Towards consistent high-fidelity text-to-3d generation with deterministic sampling prior. arXiv preprint arXiv:2401.09050, 2024.
  41. Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models. In CVPR, 2024.
  42. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  43. Hifa: High-fidelity text-to-3d with advanced diffusion guidance. arXiv preprint arXiv:2305.18766, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xingyu Miao (11 papers)
  2. Haoran Duan (36 papers)
  3. Varun Ojha (21 papers)
  4. Jun Song (89 papers)
  5. Tejal Shah (11 papers)
  6. Yang Long (61 papers)
  7. Rajiv Ranjan (66 papers)
Citations (2)

Summary

Understanding Trajectory Score Matching for Improved Text-to-3D Generation

Introduction

Creating high-quality 3D content from natural language descriptions has become a significant focus in various fields, from virtual and augmented reality to gaming and animation. The primary challenge has been generating accurate 3D models without extensive manual effort. This paper introduces a new method called Trajectory Score Matching (TSM), which addresses some limitations of existing methods in generating high-quality 3D models using text descriptions.

Key Ideas

Background on Existing Methods

Most text-to-3D generation methods leverage pre-trained text-to-image diffusion models. These models help in training 3D neural representations, such as Neural Radiance Fields (NeRF) and 3D Gaussian splatting, to create 3D objects described by text. A common approach is Score Distillation Sampling (SDS), which optimizes the 3D model by distilling knowledge from pre-trained 2D models. However, SDS often results in over-smoothed and averaged outputs, which lack the desired detail and clarity.

Interval Score Matching (ISM) to the Rescue

One noteworthy improvement over SDS is the Interval Score Matching (ISM) technique introduced by previous research. ISM helps in maintaining better consistency by reducing over-smoothing. It does this by generating pseudo-ground truths through the DDIM (Denoising Diffusion Implicit Models) inversion process. However, ISM still suffers from accumulated errors inherent in the DDIM inversion process, leading to inconsistency in some regions of the generated 3D model.

Trajectory Score Matching (TSM)

To mitigate these accumulated errors, the paper introduces Trajectory Score Matching (TSM). TSM generates two paths from the same starting point during the DDIM process. By following these dual paths, TSM reduces the inconsistencies that the single-path ISM method struggles with. Here are the key steps involved:

  1. Starting Point Generation: Like ISM, TSM uses the DDIM inversion to predict noisy latents.
  2. Dual Path Creation: From the same initial point, TSM calculates two paths, one with less noise (x_μ), and the other with more noise (x_t).
  3. Minimizing Errors: By comparing these two paths, TSM aims to ensure that the generated 3D model remains consistent and detailed, avoiding the cumulative error issues seen in ISM.

Practical Enhancements

Stable Diffusion XL (SDXL) for Guidance: Another innovative aspect of this paper is leveraging the high-resolution capabilities of Stable Diffusion XL (SDXL). Unlike previous models that are limited to lower resolutions, SDXL can generate 1024x1024 image outputs. This significantly improves the quality of the 3D models.

Pixel-by-Pixel Gradient Clipping: The paper also addresses gradient instability problems during the 3D Gaussian splatting process when using SDXL. They propose a novel pixel-by-pixel gradient clipping method, ensuring that the new high-resolution models maintain clarity and detail without introducing artifacts.

Results

The experimental results showcased in the paper are quite compelling, with TSM significantly outperforming other state-of-the-art methods. Here are some highlights:

  • Visual Quality: The results generated by Dreamer XL using TSM appear crisper and more detailed compared to previous methods. For instance, the intricate details in objects and character textures are rendered more accurately.
  • Consistency: TSM shows fewer inconsistencies and maintains better semantic alignment with the input text prompts compared to ISM, which often shows averaging effects.
  • Quantitative Metrics: The paper reports an improved CLIP-Score of up to 0.297 and a reduction in A-LPIPS, indicating better artifact mitigation in 3D models.

Implications and Future Work

The introduction of TSM and the novel gradient clipping technique opens new doors for more accurate and high-resolution 3D content generation from text. This can have profound impacts on fields like virtual reality, gaming, and digital content creation by enabling faster and more scalable 3D modeling processes.

Future Directions:

  1. Integration into Commercial Tools: Given its promising results, TSM could be incorporated into commercial 3D modeling tools to streamline the content creation pipeline.
  2. Further Optimization: While TSM is a significant step forward, further optimization and scaling could make it even more efficient.
  3. Exploration of Other High-Resolution Diffusion Models: Future research could explore other advanced high-resolution diffusion models to see if they can provide even better guidance for 3D generation.

Conclusion

The introduction of Trajectory Score Matching (TSM) addresses a critical limitation in existing text-to-3D generation methods, providing a more consistent and detailed output. By reducing accumulated errors through a dual-path approach and leveraging advanced techniques like pixel-by-pixel gradient clipping, this paper sets a new standard in the field of 3D content generation. While there is still room for improvement, TSM marks a substantial advancement towards more efficient and accurate 3D modeling from text.

Reddit Logo Streamline Icon: https://streamlinehq.com