Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation (2403.17001v1)

Published 25 Mar 2024 in cs.CV and cs.MM

Abstract: Recent innovations on text-to-3D generation have featured Score Distillation Sampling (SDS), which enables the zero-shot learning of implicit 3D models (NeRF) by directly distilling prior knowledge from 2D diffusion models. However, current SDS-based models still struggle with intricate text prompts and commonly result in distorted 3D models with unrealistic textures or cross-view inconsistency issues. In this work, we introduce a novel Visual Prompt-guided text-to-3D diffusion model (VP3D) that explicitly unleashes the visual appearance knowledge in 2D visual prompt to boost text-to-3D generation. Instead of solely supervising SDS with text prompt, VP3D first capitalizes on 2D diffusion model to generate a high-quality image from input text, which subsequently acts as visual prompt to strengthen SDS optimization with explicit visual appearance. Meanwhile, we couple the SDS optimization with additional differentiable reward function that encourages rendering images of 3D models to better visually align with 2D visual prompt and semantically match with text prompt. Through extensive experiments, we show that the 2D Visual Prompt in our VP3D significantly eases the learning of visual appearance of 3D models and thus leads to higher visual fidelity with more detailed textures. It is also appealing in view that when replacing the self-generating visual prompt with a given reference image, VP3D is able to trigger a new task of stylized text-to-3D generation. Our project page is available at https://vp3d-cvpr24.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  2. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  3. Controlstyle: Text-driven stylized image generation using diffusion priors. In ACM Multimedia, 2023a.
  4. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In ICCV, 2023b.
  5. Animating your life: Real-time video-to-animation translation. In ACM MM, 2019a.
  6. Mocycle-gan: Unpaired video-to-video translation. In ACM MM, 2019b.
  7. 3d creation at your fingertips: From text or image to 3d assets. In ACM MM, 2023c.
  8. Control3d: Towards controllable text-to-3d generation. In ACM MM, 2023d.
  9. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In CVPR, 2023.
  10. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  11. T33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTbench: Benchmarking current progress in text-to-3d generation. arXiv preprint arXiv:2310.02977, 2023.
  12. Classifier-free diffusion guidance. In NeurIPS Workshop, 2022.
  13. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  14. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  15. Low-rank adaptation of large language models, arxiv, 2021. arXiv preprint arXiv:2106.09685, 10.
  16. Zero-shot text-guided object generation with dream fields. In CVPR, 2022.
  17. Magic3d: High-resolution text-to-3d content creation. In CVPR, 2023.
  18. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. In NeurIPS, 2024.
  19. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023.
  20. Semantic-conditional diffusion networks for image captioning. In CVPR, 2023.
  21. Realfusion: 360deg reconstruction of any object from a single image. In CVPR, 2023.
  22. Latent-nerf for shape-guided generation of 3d shapes and textures. In CVPR, 2023.
  23. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV, 2020.
  24. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 2022.
  25. Improved denoising diffusion probabilistic models. In ICLR, 2021.
  26. To create what you tell: Generating videos from captions. In ACM Multimedia, 2017.
  27. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.
  28. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. In ICLR, 2024.
  29. Learning transferable visual models from natural language supervision. In ICML, 2021.
  30. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  31. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  32. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  33. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  34. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  35. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  36. Jiaxiang Tang. Stable-dreamfusion: Text-to-3d with stable-diffusion, 2022. https://github.com/ashawkey/stable-dreamfusion.
  37. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. In ICCV, 2023.
  38. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, 2023a.
  39. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2023b.
  40. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360deg views. In CVPR, 2023a.
  41. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, 2023b.
  42. 3dstyle-diffusion: Pursuing fine-grained text-driven 3d stylization with 2d diffusion models. In ACM MM, 2023.
  43. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yingwei Pan (77 papers)
  2. Haibo Yang (38 papers)
  3. Ting Yao (127 papers)
  4. Tao Mei (209 papers)
  5. Yang Chen (535 papers)
Citations (12)
X Twitter Logo Streamline Icon: https://streamlinehq.com