Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DreamUp3D: Object-Centric Generative Models for Single-View 3D Scene Understanding and Real-to-Sim Transfer (2402.16308v1)

Published 26 Feb 2024 in cs.RO

Abstract: 3D scene understanding for robotic applications exhibits a unique set of requirements including real-time inference, object-centric latent representation learning, accurate 6D pose estimation and 3D reconstruction of objects. Current methods for scene understanding typically rely on a combination of trained models paired with either an explicit or learnt volumetric representation, all of which have their own drawbacks and limitations. We introduce DreamUp3D, a novel Object-Centric Generative Model (OCGM) designed explicitly to perform inference on a 3D scene informed only by a single RGB-D image. DreamUp3D is a self-supervised model, trained end-to-end, and is capable of segmenting objects, providing 3D object reconstructions, generating object-centric latent representations and accurate per-object 6D pose estimates. We compare DreamUp3D to baselines including NeRFs, pre-trained CLIP-features, ObSurf, and ObPose, in a range of tasks including 3D scene reconstruction, object matching and object pose estimation. Our experiments show that our model outperforms all baselines by a significant margin in real-world scenarios displaying its applicability for 3D scene understanding tasks while meeting the strict demands exhibited in robotics applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in European Conference on Computer Vision.   Springer, 2020, pp. 405–421.
  2. K. Gao, Y. Gao, H. He, D. Lu, L. Xu, and J. Li, “Nerf: Neural radiance field in 3d vision, a comprehensive review,” ArXiv, vol. abs/2210.00379, 2022.
  3. J. Kerr, L. Fu, H. Huang, Y. Avigal, M. Tancik, J. Ichnowski, A. Kanazawa, and K. Goldberg, “Evo-nerf: Evolving nerf for sequential robot grasping of transparent objects,” in 6th Annual Conference on Robot Learning, 2022.
  4. J. Liu, Q. Nie, Y. Liu, and C. Wang, “Nerf-loc: Visual localization with conditional neural radiance field,” in IEEE International Conference on Robotics and Automation, ICRA 2023.   IEEE, 2023, pp. 9385–9392.
  5. S. Zhong, A. Albini, O. P. Jones, P. Maiolino, and I. Posner, “Touching a nerf: Leveraging neural radiance fields for tactile sensory data generation,” in Proceedings of The 6th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol. 205.   PMLR, 14–18 Dec 2023, pp. 1618–1628. [Online]. Available: https://proceedings.mlr.press/v205/zhong23a.html
  6. J. Abou-Chakra, F. Dayoub, and N. Sünderhauf, “Implicit object mapping with noisy data,” arXiv preprint arXiv:2204.10516, 2022.
  7. T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Transactions on Graphics (ToG), vol. 41, no. 4, pp. 1–15, 2022.
  8. B. Yang, Y. Zhang, Y. Xu, Y. Li, H. Zhou, H. Bao, G. Zhang, and Z. Cui, “Learning object-compositional neural radiance field for editable scene rendering,” in International Conference on Computer Vision (ICCV), October 2021.
  9. F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf, “Object-centric learning with slot attention,” ArXiv, vol. abs/2006.15055, 2020.
  10. M. Engelcke, O. P. Jones, and I. Posner, “Genesis-v2: Inferring unordered object representations without iterative refinement,” in Neural Information Processing Systems, vol. 34, 2021, pp. 8085–8094.
  11. Z. Lin, Y. Wu, S. V. Peri, W. Sun, G. Singh, F. Deng, J. Jiang, and S. Ahn, “SPACE: unsupervised object-oriented scene representation via spatial attention and decomposition,” in 8th International Conference on Learning Representations, ICLR 2020.   OpenReview.net, 2020.
  12. H. Yu, L. J. Guibas, and J. Wu, “Unsupervised discovery of object radiance fields,” in The Tenth International Conference on Learning Representations, ICLR 2022.   OpenReview.net, 2022.
  13. K. Stelzner, K. Kersting, and A. R. Kosiorek, “Decomposing 3d scenes into objects via unsupervised volume segmentation,” ArXiv, vol. abs/2104.01148, 2021.
  14. Y. Wu, O. P. Jones, and I. Posner, “Obpose: Leveraging canonical pose for object-centric scene inference in 3d,” ArXiv, vol. abs/2206.03591, 2022.
  15. K. Schwarz, Y. Liao, M. Niemeyer, and A. Geiger, “GRAF: Generative radiance fields for 3D-aware image synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 20 154–20 166, 2020.
  16. M. Niemeyer and A. Geiger, “Giraffe: Representing scenes as compositional generative neural feature fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 453–11 464.
  17. A. Zhou, M. J. Kim, L. Wang, P. Florence, and C. Finn, “Nerf in the palm of your hand: Corrective augmentation for robotics via novel-view synthesis,” arXiv preprint arXiv:2301.08556, 2023.
  18. E. Sucar, S. Liu, J. Ortiz, and A. J. Davison, “imap: Implicit mapping and positioning in real-time,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6229–6238.
  19. J. Ichnowski, Y. Avigal, J. Kerr, and K. Goldberg, “Dex-nerf: Using a neural radiance field to grasp transparent objects,” in Conference on Robot Learning, 2021, ser. Proceedings of Machine Learning Research, vol. 164.   PMLR, 2021, pp. 526–536.
  20. Q. Dai, Y. Zhu, Y. Geng, C. Ruan, J. Zhang, and H. Wang, “Graspnerf: Multiview-based 6-dof grasp detection for transparent and specular objects using generalizable nerf,” arXiv preprint arXiv:2210.06575, 2022.
  21. V. Blukis, T. Lee, J. Tremblay, B. Wen, I. S. Kweon, K.-J. Yoon, D. Fox, and S. Birchfield, “Neural fields for robotic object manipulation from a single image,” arXiv preprint arXiv:2210.12126, 2022.
  22. C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. M. Botvinick, and A. Lerchner, “Monet: Unsupervised scene decomposition and representation,” ArXiv, vol. abs/1901.11390, 2019.
  23. M. Engelcke, A. R. Kosiorek, O. P. Jones, and I. Posner, “GENESIS: generative scene inference and sampling with object-centric latent representations,” in 8th International Conference on Learning Representations, ICLR 2020.   OpenReview.net, 2020.
  24. Y. Wu, O. P. Jones, M. Engelcke, and I. Posner, “Apex: Unsupervised, object-centric scene segmentation and tracking for robot manipulation,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 3375–3382.
  25. J. Yamada, J. Collins, and I. Posner, “Efficient skill acquisition for complex manipulation tasks in obstructed environments,” arXiv preprint arXiv:2303.03365, 2023.
  26. B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar, “The YCB object and model set: Towards common benchmarks for manipulation research,” in 2015 International Conference on Advanced Robotics (ICAR).   IEEE, 2015, pp. 510–517.
  27. J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick, “CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2901–2910.
  28. M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., “A density-based algorithm for discovering clusters in large spatial databases with noise.” in kdd, vol. 96, no. 34, 1996, pp. 226–231.
  29. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  30. H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas, “Kpconv: Flexible and deformable convolution for point clouds,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6411–6420.
  31. E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. De Mello, O. Gallo, L. J. Guibas, J. Tremblay, S. Khamis, et al., “Efficient geometry-aware 3d generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 123–16 133.
  32. M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, 1981.
  33. A. Yu, R. Li, M. Tancik, H. Li, R. Ng, and A. Kanazawa, “Plenoctrees for real-time rendering of neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5752–5761.
  34. M. Tancik, E. Weber, E. Ng, R. Li, B. Yi, J. Kerr, T. Wang, A. Kristoffersen, J. Austin, K. Salahi, A. Ahuja, D. McAllister, and A. Kanazawa, “Nerfstudio: A modular framework for neural radiance field development,” in ACM SIGGRAPH 2023 Conference Proceedings, ser. SIGGRAPH ’23, 2023.
  35. W. Goodwin, S. Vaze, I. Havoutis, and I. Posner, “Semantically grounded object matching for robust robotic scene rearrangement,” 2022 International Conference on Robotics and Automation (ICRA), pp. 11 138–11 144, 2021.
  36. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning, 2021.
  37. Y. Xiang, C. Xie, A. Mousavian, and D. Fox, “Learning rgb-d feature embeddings for unseen object instance segmentation,” in Conference on Robot Learning.   PMLR, 2021, pp. 461–470.
  38. D. Morrison, J. Leitner, and P. Corke, “Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach,” in Robotics: Science and Systems XIV, 2018, 2018.
  39. F. Tosi, F. Aleotti, P. Z. Ramirez, M. Poggi, S. Salti, L. D. Stefano, and S. Mattoccia, “Distilled semantics for comprehensive scene understanding from videos,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4654–4665.
  40. Y. You, Y. Lou, C. Li, Z. Cheng, L. Li, L. Ma, C. Lu, and W. Wang, “Keypointnet: A large-scale 3d keypoint dataset aggregated from numerous human annotations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 647–13 656.
  41. M. Tatarchenko, S. R. Richter, R. Ranftl, Z. Li, V. Koltun, and T. Brox, “What do single-view 3d reconstruction networks learn?” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3405–3414.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yizhe Wu (5 papers)
  2. Haitz Sáez de Ocáriz Borde (26 papers)
  3. Jack Collins (19 papers)
  4. Oiwi Parker Jones (24 papers)
  5. Ingmar Posner (77 papers)

Summary

We haven't generated a summary for this paper yet.