Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation (2403.12396v1)

Published 19 Mar 2024 in cs.CV and cs.RO

Abstract: This paper studies a new open-set problem, the open-vocabulary category-level object pose and size estimation. Given human text descriptions of arbitrary novel object categories, the robot agent seeks to predict the position, orientation, and size of the target object in the observed scene image. To enable such generalizability, we first introduce OO3D-9D, a large-scale photorealistic dataset for this task. Derived from OmniObject3D, OO3D-9D is the largest and most diverse dataset in the field of category-level object pose and size estimation. It includes additional annotations for the symmetry axis of each category, which help resolve symmetric ambiguity. Apart from the large-scale dataset, we find another key to enabling such generalizability is leveraging the strong prior knowledge in pre-trained visual-language foundation models. We then propose a framework built on pre-trained DinoV2 and text-to-image stable diffusion models to infer the normalized object coordinate space (NOCS) maps of the target instances. This framework fully leverages the visual semantic prior from DinoV2 and the aligned visual and language knowledge within the text-to-image diffusion model, which enables generalization to various text descriptions of novel categories. Comprehensive quantitative and qualitative experiments demonstrate that the proposed open-vocabulary method, trained on our large-scale synthesized data, significantly outperforms the baseline and can effectively generalize to real-world images of unseen categories. The project page is at https://ov9d.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Method for registration of 3-d shapes. In Sensor fusion IV, pages 586–606. Spie, 1992.
  2. Sdfest: Categorical pose and shape estimation of objects from rgb-d using signed distance fields. RAL, 7(4):9597–9604, 2022.
  3. Rgb-d-based categorical object pose and shape estimation: Methods, datasets, and evaluation. arXiv preprint arXiv:2301.08147, 2023.
  4. Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021.
  5. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], 2015.
  6. Learning canonical shape space for category-level 6d object pose and size estimation. In CVPR, pages 11973–11982, 2020a.
  7. Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation. In ICCV, pages 2773–2782, 2021.
  8. Category level object pose estimation via neural analysis-by-synthesis. In ECCV, pages 139–156. Springer, 2020b.
  9. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023.
  10. icaps: Iterative category-level object pose and shape estimation. RAL, 7(2):1784–1791, 2022.
  11. Blenderproc2: A procedural pipeline for photorealistic rendering. Journal of Open Source Software, 8(82):4901, 2023.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  13. Google scanned objects: A high-quality dataset of 3d scanned household items. In ICRA, pages 2553–2560. IEEE, 2022.
  14. Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, pages 14084–14093, 2022.
  15. Taming transformers for high-resolution image synthesis. In CVPR, pages 12873–12883, 2021.
  16. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  17. Vision meets robotics: The kitti dataset. IJRR, 32(11):1231–1237, 2013.
  18. Ross Girshick. Fast r-cnn. In ICCV, pages 1440–1448, 2015.
  19. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  20. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In IEEE/CVF conference on computer vision and pattern recognition, pages 11632–11641, 2020.
  21. Ffb6d: A full flow bidirectional fusion network for 6d pose estimation. In CVPR, pages 3003–3013, 2021.
  22. Towards self-supervised category-level object pose and size estimation. arXiv preprint arXiv:2203.02884, 2022a.
  23. Fs6d: Few-shot 6d pose estimation of novel objects. In CVPR, pages 6814–6824, 2022b.
  24. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In ACCV, pages 548–562. Springer, 2013.
  25. Centersnap: Single-shot multi-object 3d shape reconstruction and categorical 6d pose and size estimation. In ICRA, pages 10632–10640. IEEE, 2022.
  26. Fast 6d pose estimation from a monocular image using hierarchical pose trees. In ECCV, pages 398–413. Springer, 2016.
  27. Real-time 6d object pose estimation on cpu. In IROS, pages 3451–3458. IEEE, 2019.
  28. Category-level metric scale object shape and pose estimation. RAL, 6(4):8575–8582, 2021.
  29. Generative category-level shape and pose estimation with semantic primitives. In CoRL, pages 1390–1400. PMLR, 2023.
  30. Deepim: Deep iterative matching for 6d pose estimation. In ECCV, pages 683–698, 2018.
  31. Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, pages 7061–7070, 2023.
  32. Category-level 6d object pose and size estimation using self-supervised deep prior deformation networks. In ECCV, pages 19–34. Springer, 2022.
  33. Ist-net: Prior-free category-level pose estimation with implicit space transformation. In ICCV, pages 13978–13988, 2023.
  34. Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images. In ECCV, pages 298–315. Springer, 2022.
  35. Decoupled weight decay regularization. In ICLR, 2018.
  36. Pose estimation for augmented reality: A hands-on survey. TVCG, 22(12):2633–2651, 2016.
  37. Point set registration: Coherent point drift. PAMI, 32(12):2262–2275, 2010.
  38. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  39. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017.
  40. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  41. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
  42. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, pages 10901–10911, 2021.
  43. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  44. Photorealistic text-to-image diffusion models with deep language understanding. NIPS, 35:36479–36494, 2022.
  45. Condor: Self-supervised canonicalization of 3d pose for partial shapes. In CVPR, pages 16969–16979, 2022.
  46. Laion-5b: An open large-scale dataset for training next generation image-text models. NIPS, 35:25278–25294, 2022.
  47. Osop: A multi-stage one shot object pose estimation framework. In CVPR, pages 6835–6844, 2022.
  48. Bop challenge 2022 on detection, segmentation and pose estimation of specific rigid objects. In CVPR, pages 2784–2793, 2023.
  49. Emergent correspondence from image diffusion. NIPS, 36, 2024.
  50. Shape prior deformation for categorical 6d object pose and size estimation. In ECCV, pages 530–546. Springer, 2020.
  51. Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. PAMI, 13(04):376–380, 1991.
  52. Attention is all you need. NIPS, 30, 2017.
  53. Densefusion: 6d object pose estimation by iterative dense fusion. In CVPR, pages 3343–3352, 2019a.
  54. Normalized object coordinate space for category-level 6d object pose and size estimation. In CVPR, pages 2642–2651, 2019b.
  55. Category-level 6d object pose estimation via cascaded relation and recurrent reconstruction networks. in 2021 ieee. In IROS, page 5, 2021.
  56. Object pose estimation from rgb-d images with affordance-instance segmentation constraint for semantic robot manipulation. RAL, 9(1):595–602, 2023.
  57. Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. In CVPR, pages 606–617, 2023.
  58. Towards open vocabulary learning: A survey. PAMI, 2024.
  59. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In CVPR, pages 803–814, 2023.
  60. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. RSS, 2018.
  61. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, pages 2955–2966, 2023.
  62. Neural window fully-connected crfs for monocular depth estimation. In CVPR, pages 3916–3925, 2022.
  63. Category-level 6d object pose estimation in the wild: A semi-supervised learning approach and a new dataset. NIPS, 35:27469–27483, 2022.
  64. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. NIPS, 36, 2024.
  65. Fusing local similarities for retrieval-based 3d orientation estimation of unseen objects. In ECCV, pages 106–122. Springer, 2022.
  66. Unleashing text-to-image diffusion models for visual perception. In ICCV, pages 5729–5739, 2023.
  67. Segment everything everywhere all at once. NIPS, 36, 2024.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com