Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scalable 3D Captioning with Pretrained Models (2306.07279v2)

Published 12 Jun 2023 in cs.CV

Abstract: We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects. This approach utilizes pretrained models from image captioning, image-text alignment, and LLM to consolidate captions from multiple views of a 3D asset, completely side-stepping the time-consuming and costly process of manual annotation. We apply Cap3D to the recently introduced large-scale 3D dataset, Objaverse, resulting in 660k 3D-text pairs. Our evaluation, conducted using 41k human annotations from the same dataset, demonstrates that Cap3D surpasses human-authored descriptions in terms of quality, cost, and speed. Through effective prompt engineering, Cap3D rivals human performance in generating geometric descriptions on 17k collected annotations from the ABO dataset. Finally, we finetune Text-to-3D models on Cap3D and human captions, and show Cap3D outperforms; and benchmark the SOTA including Point-E, Shape-E, and DreamFusion.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (107)
  1. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  2. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv, 2023.
  3. Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022.
  4. 3d modeling and computer graphics in virtual reality. In Mixed Reality and Three-Dimensional Computer Graphics. IntechOpen, 2020.
  5. Rick Parent. Computer animation: algorithms and techniques. Newnes, 2012.
  6. A study on the challenges of using robotics simulators for testing. arXiv preprint arXiv:2004.07368, 2020.
  7. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. In Conference on Robot Learning, pages 80–93. PMLR, 2023a.
  8. Carla: An open urban driving simulator. In Conference on robot learning, pages 1–16. PMLR, 2017.
  9. Objaverse: A universe of annotated 3d objects. 2023.
  10. Redcaps: Web-curated image-text data created by the people, for the people. NeurIPS, 2021.
  11. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  12. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
  13. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv, 2021.
  14. Laion-5b: An open large-scale dataset for training next generation image-text models. 2022.
  15. Learning Visual Representations via Language-Guided Sampling. In CVPR, 2023.
  16. Learning transferable visual models from natural language supervision. In ICML, 2021.
  17. Slip: Self-supervision meets language-image pre-training. In ECCV, 2022.
  18. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  19. Zero-shot text-to-image generation. In ICML, 2021.
  20. Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV, 2022.
  21. Photorealistic text-to-image diffusion models with deep language understanding. 2022.
  22. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  23. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. CoRR, 2021.
  24. Hierarchical text-conditional image generation with clip latents. arXiv, 2022.
  25. Simvlm: Simple visual language model pretraining with weak supervision. ICLR, 2022.
  26. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020.
  27. Vinvl: Revisiting visual representations in vision-language models. In CVPR, 2021.
  28. Shapecaptioner: Generative caption network for 3d shapes by learning a mapping from parts detected in multiple views to sentences. In ACM MM, 2020.
  29. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv, 2023b.
  30. OpenAI. Gpt-4 technical report, 2023a.
  31. Microsoft coco: Common objects in context. In ECCV, 2014.
  32. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
  33. Im2text: Describing images using 1 million captioned photographs. NeurIPS, 2011.
  34. https://commoncrawl.org/the-data/.
  35. Abo: Dataset and benchmarks for real-world 3d object understanding. CVPR, 2022.
  36. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
  37. Self-critical sequence training for image captioning. In CVPR, 2017.
  38. Neural baby talk. In CVPR, 2018.
  39. Mattnet: Modular attention network for referring expression comprehension. In CVPR, 2018.
  40. Stacked cross attention for image-text matching. In ECCV, 2018.
  41. Deep residual learning for image recognition. In CVPR, 2016.
  42. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
  43. U-net: Convolutional networks for biomedical image segmentation. In MICCAI 2015, 2015.
  44. Long short-term memory. Neural computation, 1997.
  45. Bidirectional recurrent neural networks. transactions on Signal Processing, 1997.
  46. Nocaps: Novel object captioning at scale. In ICCV, 2019.
  47. Generative adversarial nets. 2014.
  48. Analyzing and improving the image quality of stylegan. In CVPR, 2020.
  49. Neural discrete representation learning. NeurIPS, 2017.
  50. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  51. Cogview: Mastering text-to-image generation via transformers. NeurIPS, 2021.
  52. Virtex: Learning visual representations from textual annotations. In CVPR, 2021.
  53. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  54. Diffusion models beat gans on image synthesis. NeurIPS, 2021.
  55. Denoising diffusion probabilistic models. NeurIPS, 33, 2020.
  56. Elucidating the design space of diffusion-based generative models. NeurIPS, 2022.
  57. Improved denoising diffusion probabilistic models. In ICML, 2021.
  58. Generative modeling by estimating gradients of the data distribution. NeurIPS, 2019.
  59. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  60. Segment anything. arXiv, 2023.
  61. Clipscore: A reference-free evaluation metric for image captioning. arXiv, 2021.
  62. OpenAI. Gpt-4 technical report. arXiv, 2023b.
  63. Adding conditional control to text-to-image diffusion models. arXiv, 2023.
  64. Justin N. M. Pinkney. Pokemon blip captions. https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions/, 2022.
  65. Openshape: Scaling up 3d shape representation towards open-world understanding. arXiv, 2023a.
  66. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  67. Ulip: Learning unified representation of language, image and point cloud for 3d understanding. CVPR, 2023a.
  68. Shapenet: An information-rich 3d model repository. arXiv, 2015.
  69. Pix3d: Dataset and methods for single-image 3d shape modeling. In CVPR, 2018.
  70. Parsing ikea objects: Fine pose estimation. In ICCV, 2013.
  71. 3d-future: 3d furniture shape with texture. IJCV, 2021.
  72. ShapeGlot: Learning language for shape differentiation. CoRR, 2019.
  73. Text2shape: Generating shapes from natural language by learning joint embeddings. In ACCV, 2019.
  74. Shapecrafter: A recursive text-conditioned 3d shape generation model. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=KUOKpojFr_.
  75. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
  76. Scanrefer: 3d object localization in rgb-d scans using natural language. In ECCV, 2020.
  77. Neural shape compiler: A unified framework for transforming between text, point cloud, and program. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=gR9UVgH8PZ.
  78. Show and tell: A neural image caption generator. In ICML, 2015.
  79. Scan2cap: Context-aware dense captioning in rgb-d scans. In CVPR, 2021.
  80. Clip-forge: Towards zero-shot text-to-shape generation. In CVPR, 2022.
  81. Autosdf: Shape priors for 3d completion, reconstruction and generation. In CVPR, 2022.
  82. Taps3d: Text-guided 3d textured shape generation from pseudo supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16805–16815, 2023.
  83. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. arXiv preprint arXiv:2301.11445, 2023.
  84. Zero-shot text-guided object generation with dream fields. In CVPR, 2022.
  85. Textmesh: Generation of realistic 3d meshes from text prompts. arXiv, 2023.
  86. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
  87. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. arXiv, 2023b.
  88. Point-e: A system for generating 3d point clouds from complex prompts. arXiv, 2022.
  89. Shap-e: Generating conditional 3d implicit functions. arXiv, 2023.
  90. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5826–5835, 2021.
  91. Multiview compressive coding for 3d reconstruction. arXiv, 2023.
  92. Zero-1-to-3: Zero-shot one image to 3d object. arXiv, 2023b.
  93. Eva: Exploring the limits of masked visual representation learning at scale. CVPR, 2023.
  94. Scaling instruction-finetuned language models. arXiv, 2022.
  95. The curious case of neural text degeneration. In ICLR, 2020.
  96. Retinaface: Single-shot multi-level face localisation in the wild. In CVPR, 2020.
  97. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  98. Gant Laborde. Deep nn for nsfw detection. https://github.com/GantMan/nsfw_model. [Online; accessed 7-May-2023].
  99. https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words. [Online; accessed 7-May-2023].
  100. Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, 2021.
  101. Get3d: A generative model of high quality 3d textured shapes learned from images. NeurIPS, 2022.
  102. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  103. Lora: Low-rank adaptation of large language models, 2021.
  104. Jiaxiang Tang. Stable-dreamfusion: Text-to-3d with stable-diffusion, 2022. https://github.com/ashawkey/stable-dreamfusion.
  105. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv, 2023.
  106. Karlo-v1.0.alpha on coyo-100m and cc15m. https://github.com/kakaobrain/karlo, 2022.
  107. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Tiange Luo (13 papers)
  2. Chris Rockwell (9 papers)
  3. Honglak Lee (174 papers)
  4. Justin Johnson (56 papers)
Citations (119)