Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 96 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 38 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 96 tok/s
GPT OSS 120B 466 tok/s Pro
Kimi K2 214 tok/s Pro
2000 character limit reached

REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment (2405.18525v1)

Published 28 May 2024 in cs.CV

Abstract: Traditional image-to-3D models often struggle with scenes containing multiple objects due to biases and occlusion complexities. To address this challenge, we present REPARO, a novel approach for compositional 3D asset generation from single images. REPARO employs a two-step process: first, it extracts individual objects from the scene and reconstructs their 3D meshes using off-the-shelf image-to-3D models; then, it optimizes the layout of these meshes through differentiable rendering techniques, ensuring coherent scene composition. By integrating optimal transport-based long-range appearance loss term and high-level semantic loss term in the differentiable rendering, REPARO can effectively recover the layout of 3D assets. The proposed method can significantly enhance object independence, detail accuracy, and overall scene coherence. Extensive evaluation of multi-object scenes demonstrates that our REPARO offers a comprehensive approach to address the complexities of multi-object 3D scene generation from single images.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023.
  2. Comboverse: Compositional 3d assets creation using spatially-aware diffusion guidance. arXiv preprint arXiv:2403.12409, 2024.
  3. Text-to-3d using gaussian splatting, 2024.
  4. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.
  5. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023.
  6. Objaverse: A universe of annotated 3d objects, 2022.
  7. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  8. Disentangled 3d scene generation with layout learning. arXiv preprint arXiv:2402.16936, 2024.
  9. Fast geometric learning with symbolic matrices. Advances in Neural Information Processing Systems, 33, 2020.
  10. Modeling by example. 23(3):652–663, 2004.
  11. Graphdreamer: Compositional 3d scene synthesis from scene graphs. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  12. Ota: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 303–312, 2021.
  13. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023.
  14. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 867–876, 2022.
  15. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023.
  16. Adam: A method for stochastic optimization. In International Conference on Learning Representations, ICLR, 2015.
  17. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  18. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics, 39(6), 2020.
  19. Generative ai meets 3d: A survey on text-to-3d in aigc era. arXiv preprint arXiv:2305.06131, 2023.
  20. Instant3d: Instant text-to-3d generation. International Journal of Computer Vision, pages 1–17, 2024.
  21. Differentiable monte carlo ray tracing through edge sampling. ACM Transactions on Graphics (TOG), 37(6):1–11, 2018.
  22. Controllable text-to-3d generation via surface-aligned gaussian splatting, 2024.
  23. Advances in 3d generation: A survey. arXiv preprint arXiv:2401.17807, 2024.
  24. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 300–309, June 2023.
  25. Towards language-guided interactive 3d generation: Llms as layout interpreter with generative feedback. arXiv preprint arXiv:2305.15808, 2023.
  26. Consistent123: One image to highly consistent 3d asset using case-aware diffusion priors. arXiv preprint arXiv:2309.17261, 2023.
  27. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv preprint arXiv:2311.07885, 2023.
  28. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems, 36, 2024.
  29. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023.
  30. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023.
  31. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  32. Realfusion: 360 reconstruction of any object from a single image. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  33. Nerf: Representing scenes as neural radiance fields for view synthesis. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Proceedings of the European conference on computer vision (ECCV), pages 405–421, Cham, 2020. Springer International Publishing.
  34. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia 2022 Conference Papers, SA ’22, 2022.
  35. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia 2022 conference papers, pages 1–8, 2022.
  36. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  37. Compositional 3d scene generation using locally conditioned diffusion. arXiv preprint arXiv:2303.12218, 2023.
  38. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  39. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  40. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  41. Dreambooth3d: Subject-driven text-to-3d generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2349–2359, 2023.
  42. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation, 2024.
  43. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023.
  44. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  45. Toss: High-quality text-guided novel view synthesis from a single image. arXiv preprint arXiv:2310.10644, 2023.
  46. Augmented reality and robotics: A survey and taxonomy for ar-enhanced human-robot interaction and robotic interfaces. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–33, 2022.
  47. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  48. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22819–22829, October 2023.
  49. Generating part-aware editable 3d shapes without 3d supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4466–4478, 2023.
  50. Triposr: Fast 3d object reconstruction from a single image. arXiv preprint arXiv:2403.02151, 2024.
  51. Alex Trevithick and Bo Yang. Grf: Learning a general radiance field for 3d representation and rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15182–15192, 2021.
  52. Attention is all you need. Advances in neural information processing systems, 30:5998–6008, 2017.
  53. Cg3d: Compositional generation for text-to-3d via gaussian splatting. arXiv preprint arXiv:2311.17907, 2023.
  54. Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3825–3834, 2022.
  55. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation, 2022.
  56. Luciddreaming: Controllable object-centric 3d generation. arXiv preprint arXiv:2312.00588, 2023.
  57. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems, 36, 2024.
  58. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  59. Consistent123: Improve consistency for one image to 3d object synthesis. arXiv preprint arXiv:2310.08092, 2023.
  60. Differentiable rendering using rgbxy derivatives and optimal transport. ACM Transactions on Graphics (TOG), 41(6):1–13, 2022.
  61. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20908–20918, June 2023.
  62. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model, 2023.
  63. Frankenstein: Generating semantic-compositional 3d scenes in one tri-plane. arXiv preprint arXiv:2403.16210, 2024.
  64. Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models. In CVPR, 2024.
  65. Dreamscape: 3d scene creation via gaussian splatting joint correlation modeling. arXiv preprint arXiv:2404.09227, 2024.
  66. Repaint123: Fast and high-quality one image to 3d generation with progressive controllable 2d repainting. arXiv preprint arXiv:2312.13271, 2023.
  67. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
Citations (2)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a dual-step method that first extracts individual objects for detailed 3D mesh reconstruction.
  • It introduces differentiable layout optimization by integrating optimal transport-based appearance loss and semantic loss to achieve spatial and semantic coherence.
  • Evaluations demonstrate that REPARO significantly outperforms traditional methods in handling occlusions and biases in multi-object 3D scene generation.

The paper "REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment," published in May 2024, introduces a novel method called REPARO to tackle the challenge of generating 3D assets from single images, specifically in scenarios containing multiple objects. Traditional image-to-3D reconstruction methods often fail in such settings due to biases and the complexities arising from occlusions.

REPARO addresses these issues through a two-step compositional process:

  1. Extraction and Mesh Reconstruction: In the first step, individual objects are extracted from the input scene. These objects are then individually reconstructed into 3D meshes using existing state-of-the-art image-to-3D models. This step ensures that each object's geometric details are captured accurately without the interference of other objects.
  2. Differentiable Layout Optimization: In the second step, REPARO employs differentiable rendering techniques to optimize the spatial layout of these individual 3D meshes. The method integrates both an optimal transport-based long-range appearance loss term and a high-level semantic loss term within the differentiable rendering framework. The optimal transport-based appearance loss helps in correcting spatial misalignments while ensuring coherent scene composition, and the high-level semantic loss ensures the semantic consistency of the layout.

The core innovation lies in using these advanced loss terms, which aid in recovering the layout of 3D assets effectively. This dual-loss integration helps in preserving object independence, enhancing detail accuracy, and maintaining overall scene coherence.

The evaluations demonstrate that REPARO significantly improves upon existing methods in the context of multi-object scenes. The method shows a notable enhancement in terms of scene coherence and object detail accuracy, effectively overcoming the common issues related to biases and occlusions in multi-object 3D scene generation from single images. This comprehensive approach makes REPARO a promising solution for complex 3D asset generation tasks.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.