Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Objaverse-XL: A Universe of 10M+ 3D Objects (2307.05663v1)

Published 11 Jul 2023 in cs.CV and cs.AI

Abstract: Natural language processing and 2D vision models have attained remarkable proficiency on many tasks primarily by escalating the scale of training data. However, 3D vision tasks have not seen the same progress, in part due to the challenges of acquiring high-quality 3D data. In this work, we present Objaverse-XL, a dataset of over 10 million 3D objects. Our dataset comprises deduplicated 3D objects from a diverse set of sources, including manually designed objects, photogrammetry scans of landmarks and everyday items, and professional scans of historic and antique artifacts. Representing the largest scale and diversity in the realm of 3D datasets, Objaverse-XL enables significant new possibilities for 3D vision. Our experiments demonstrate the improvements enabled with the scale provided by Objaverse-XL. We show that by training Zero123 on novel view synthesis, utilizing over 100 million multi-view rendered images, we achieve strong zero-shot generalization abilities. We hope that releasing Objaverse-XL will enable further innovations in the field of 3D vision at scale.

An Analysis of Objaverse-XL: A Landmark Dataset for 3D Vision

Introduction

The field of artificial intelligence has experienced significant advancements, particularly driven by large datasets facilitating breakthrough improvements in language and image models. However, 3D vision has lagged due to the scarcity of comprehensive, high-quality datasets. To address this gap, "Objaverse-XL: A Universe of 10M+ 3D Objects" introduces an extensive 3D dataset that aims to propel 3D vision research to the level of its 2D counterparts. This paper presents Objaverse-XL, a dataset containing over 10 million deduplicated 3D objects from a diverse range of sources, thus offering unprecedented scale and diversity in 3D datasets. This analysis provides insights into the dataset's composition, its benefits for current 3D vision advancements, its applications, and future research implications.

Dataset Composition and Sources

Objaverse-XL aggregates 3D assets from a multitude of sources such as GitHub, Thingiverse, Sketchfab, Polycam, and the Smithsonian Institution. This diversity encompasses manually designed objects as well as data acquired via photogrammetry. It represents an expansion over previous datasets like Objaverse 1.0 and ShapeNet, offering more than ten times the volume of the former. Each 3D object within Objaverse-XL includes metadata such as file size, polygon count, and rendering views, facilitating a comprehensive understanding of the dataset's scope.

Methodology and Experiments

A primary focus of this paper is using Objaverse-XL to improve novel view synthesis, demonstrated through its integration into models like Zero123-XL and PixelNeRF. Experimentation shows pronounced enhancements in zero-shot generalization and scene understanding tasks when using Objaverse-XL as a pretraining corpus. For instance, Zero123-XL, fine-tuned with Objaverse-XL, outperforms earlier versions by generating more accurate and diverse novel views, capitalizing on the rich variety of the dataset. Such improvements underscore the potential of Objaverse-XL to enable more sophisticated training paradigms across 3D vision tasks.

Implications and Applications

The practical implications of Objaverse-XL are substantial, particularly for augmenting 3D model training and validation. In robotics, AR/VR, and graphics, access to such a large-scale dataset can drive advancements in applications requiring realistic 3D simulations. The dataset invites exploration into 3D object generation, reconstruction, and context-aware 3D scene understanding, potentially allowing AI to seamlessly integrate with real-world applications. Moreover, Objaverse-XL's ability to enhance model generalization to previously unseen 3D modalities—like anime or sketches—paves the way for more aligned and versatile AI applications.

Future Directions

While Objaverse-XL sets a new benchmark, future research may factor in further scaling, facilitating the transition from handcrafted data to web-crawled, diverse sources. Moreover, the exploration of selective data utilization, by understanding the inherent quality or relevance of 3D objects, can optimize model training efficiency. The paper also suggests the necessity of continued development in automated deduplication and data curation techniques given the dataset's scale. On a theoretical front, this work invites rethinking the architectural and algorithmic designs that can leverage such massive datasets effectively, potentially foreshadowing new learning paradigms in 3D AI.

Conclusion

"Objaverse-XL: A Universe of 10M+ 3D Objects" represents a significant leap forward for 3D vision research by providing a massive, diverse dataset, which empowers advanced AI models to perform complex 3D tasks with improved generalizability and versatility. The breadth of Objaverse-XL not only fuels progress in existing applications but opens avenues for new innovations in technology and AI. Given the dataset's potential to reshape 3D vision, its impact will likely reverberate across academia and industry, setting the stage for a new era in 3D understanding and applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. URL https://commoncrawl.org/the-data/.
  2. Large-scale data for multiple-view stereopsis. International Journal of Computer Vision, pages 1–16, 2016.
  3. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  4. L. Biewald. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com.
  5. Blender Online Community. Blender - a 3d modelling and rendering package. https://www.blender.org, 2023.
  6. D3: Data-driven documents. IEEE Transactions on Visualization and Computer Graphics, 2011.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
  9. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  10. Learning to predict 3d objects with an interpolation-based differentiable renderer. Advances in neural information processing systems, 32, 2019.
  11. Masked-attention mask transformer for universal image segmentation. 2022.
  12. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 628–644. Springer, 2016.
  13. Abo: Dataset and benchmarks for real-world 3d object understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21126–21136, 2022.
  14. Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022.
  15. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  16. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022.
  17. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  18. W. Falcon and The PyTorch Lightning team. PyTorch Lightning, Mar. 2019. URL https://github.com/Lightning-AI/lightning.
  19. 3d-future: 3d furniture shape with texture. International Journal of Computer Vision, 129:3313–3337, 2021.
  20. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  21. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
  22. Mesh r-cnn. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9785–9795, 2019.
  23. Array programming with NumPy. Nature, 585(7825):357–362, Sept. 2020. doi: 10.1038/s41586-020-2649-2. URL https://doi.org/10.1038/s41586-020-2649-2.
  24. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  25. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  26. J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90–95, 2007. doi: 10.1109/MCSE.2007.55.
  27. Putting nerf on a diet: Semantically consistent few-shot view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5885–5894, 2021.
  28. H. Jun and A. Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  29. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  30. Neural 3d mesh renderer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3907–3916, 2018.
  31. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  32. Parsing ikea objects: Fine pose estimation. In Proceedings of the IEEE international conference on computer vision, pages 2992–2999, 2013.
  33. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
  34. R. Liu and C. Vondrick. Humans as light bulbs: 3d human reconstruction from thermal reflection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12531–12542, 2023.
  35. Shadows shed light on 3d objects. arXiv preprint arXiv:2206.08990, 2022.
  36. Zero-1-to-3: Zero-shot one image to 3d object, 2023.
  37. Unified-io: A unified model for vision, language, and multi-modal tasks. ArXiv, abs/2206.08916, 2022.
  38. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019.
  39. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  40. G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  41. Egad! an evolved grasping analysis dataset for diversity and reproducibility in robotic manipulation. IEEE Robotics and Automation Letters, 5(3):4368–4375, 2020.
  42. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  43. OpenAI. Gpt-4 technical report. arXiv, 2023.
  44. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  45. T. pandas development team. pandas-dev/pandas: Pandas, Feb. 2020. URL https://doi.org/10.5281/zenodo.3509134.
  46. Photoshape: Photorealistic materials for large-scale shape collections. arXiv preprint arXiv:1809.09761, 2018.
  47. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  48. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  49. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  50. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  51. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  52. Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501, 2020.
  53. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  54. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  55. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  56. J. Tang. Stable-dreamfusion: Text-to-3d with stable-diffusion, 2022. https://github.com/ashawkey/stable-dreamfusion.
  57. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  58. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023.
  59. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European conference on computer vision (ECCV), pages 52–67, 2018.
  60. Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2021.
  61. M. L. Waskom. seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021, 2021. doi: 10.21105/joss.03021. URL https://doi.org/10.21105/joss.03021.
  62. Multiview compressive coding for 3d reconstruction. arXiv preprint arXiv:2301.08247, 2023a.
  63. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  64. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021.
  65. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  66. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  67. Q. Zhou and A. Jacobson. Thingi10k: A dataset of 10,000 3d-printing models. arXiv preprint arXiv:1605.04797, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (17)
  1. Matt Deitke (11 papers)
  2. Ruoshi Liu (17 papers)
  3. Matthew Wallingford (13 papers)
  4. Huong Ngo (2 papers)
  5. Oscar Michel (8 papers)
  6. Aditya Kusupati (28 papers)
  7. Alan Fan (4 papers)
  8. Christian Laforte (3 papers)
  9. Vikram Voleti (25 papers)
  10. Samir Yitzhak Gadre (12 papers)
  11. Eli VanderBilt (10 papers)
  12. Aniruddha Kembhavi (79 papers)
  13. Carl Vondrick (93 papers)
  14. Georgia Gkioxari (39 papers)
  15. Kiana Ehsani (31 papers)
  16. Ludwig Schmidt (80 papers)
  17. Ali Farhadi (138 papers)
Citations (268)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com