Learning to Estimate 6DoF Pose from Limited Data: A Few-Shot, Generalizable Approach using RGB Images (2306.07598v1)
Abstract: The accurate estimation of six degrees-of-freedom (6DoF) object poses is essential for many applications in robotics and augmented reality. However, existing methods for 6DoF pose estimation often depend on CAD templates or dense support views, restricting their usefulness in realworld situations. In this study, we present a new cascade framework named Cas6D for few-shot 6DoF pose estimation that is generalizable and uses only RGB images. To address the false positives of target object detection in the extreme few-shot setting, our framework utilizes a selfsupervised pre-trained ViT to learn robust feature representations. Then, we initialize the nearest top-K pose candidates based on similarity score and refine the initial poses using feature pyramids to formulate and update the cascade warped feature volume, which encodes context at increasingly finer scales. By discretizing the pose search range using multiple pose bins and progressively narrowing the pose search range in each stage using predictions from the previous stage, Cas6D can overcome the large gap between pose candidates and ground truth poses, which is a common failure mode in sparse-view scenarios. Experimental results on the LINEMOD and GenMOP datasets demonstrate that Cas6D outperforms state-of-the-art methods by 9.2% and 3.8% accuracy (Proj-5) under the 32-shot setting compared to OnePose++ and Gen6D.
- Self-supervised learning for domain adaptation on point clouds. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 123–133, 2021.
- Learning representations and generative models for 3d point clouds. In International conference on machine learning, pages 40–49. PMLR, 2018.
- Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7822–7831, 2021.
- Deep vit features as dense visual descriptors. arXiv preprint arXiv:2112.05814, 2(3):4, 2021.
- D3feat: Joint learning of dense detection and description of 3d local features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6359–6367, 2020.
- Pose guided rgbd feature learning for 3d object pose estimation. In Proceedings of the IEEE international conference on computer vision, pages 3856–3864, 2017.
- Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In IEEE International Conference on Computer Vision (ICCV), 2021.
- Unsupervised learning by predicting noise. In International Conference on Machine Learning, pages 517–526. PMLR, 2017.
- I like to move it: 6d pose estimation as an action decision process. arXiv preprint arXiv:2009.12678, 2020.
- Mvsformer: Learning robust image representations via transformers and temperature-based depth for multi-view stereo. arXiv preprint arXiv:2208.02541, 2022.
- Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pages 132–149, 2018.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2781–2790, 2022.
- Point-based multi-view stereo network. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1538–1547, 2019.
- Fs-net: Fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1581–1590, 2021.
- Category level object pose estimation via neural analysis-by-synthesis. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, pages 139–156. Springer, 2020.
- Deep stereo using adaptive thin volume representation with uncertainty awareness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2524–2534, 2020.
- Fully convolutional geometric features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8958–8966, 2019.
- 3dposelite: a compact 3d pose estimation using node embeddings. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1878–1887, 2021.
- Ppfnet: Global context aware local features for robust 3d point matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 195–205, 2018.
- icaps: Iterative category-level object pose and shape estimation. IEEE Robotics and Automation Letters, 7(2):1784–1791, 2022.
- So-pose: Exploiting self-occlusion for direct 6d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12396–12405, 2021.
- Gpv-pose: Category-level object pose estimation via geometry-guided point-wise voting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6781–6791, 2022.
- Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pages 1422–1430, 2015.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
- Multiresolution tree networks for 3d point cloud processing. In Proceedings of the European Conference on Computer Vision (ECCV), pages 103–118, 2018.
- Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2017.
- The perfect match: 3d point cloud matching with smoothed densities. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5545–5554, 2019.
- Zero-shot category-level object pose estimation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIX, pages 516–532. Springer, 2022.
- A papier-mâché approach to learning 3d surface generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 216–224, 2018.
- Cascade cost volume for high-resolution multi-view stereo and stereo matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Unsupervised multi-task feature learning on point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8160–8171, 2019.
- Onepose++: Keypoint-free one-shot object pose estimation without cad models. arXiv preprint arXiv:2301.07673, 2023.
- Fs6d: Few-shot 6d pose estimation of novel objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6814–6824, 2022.
- Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In 2011 international conference on computer vision, pages 858–865. IEEE, 2011.
- Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Computer Vision–ACCV 2012: 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part I 11, pages 548–562. Springer, 2013.
- Going further with point pair features. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 834–848. Springer, 2016.
- Epos: Estimating 6d pose of objects with symmetries. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11703–11712, 2020.
- Bop: Benchmark for 6d object pose estimation. In Proceedings of the European conference on computer vision (ECCV), pages 19–34, 2018.
- Bop challenge 2020 on 6d object localization. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 577–594. Springer, 2020.
- Single-stage 6d object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2930–2939, 2020.
- Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9865–9874, 2019.
- Self-supervised modal and view invariant feature learning. arXiv preprint arXiv:2005.14169, 2020.
- Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again. In Proceedings of the IEEE international conference on computer vision, pages 1521–1529, 2017.
- Cosypose: Consistent multi-view multi-object 6d pose estimation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 574–591. Springer, 2020.
- So-net: Self-organizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9397–9406, 2018.
- Usip: Unsupervised stable interest point detection from 3d point clouds. In Proceedings of the IEEE/CVF international conference on computer vision, pages 361–370, 2019.
- Dual-resolution correspondence networks. Advances in Neural Information Processing Systems, 33:17346–17357, 2020.
- Deepim: Deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 683–698, 2018.
- Cdpn: Coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7678–7687, 2019.
- Sparse steerable convolutions: An efficient learning of se (3)-equivariant features for estimation and tracking of object poses in 3d space. Advances in Neural Information Processing Systems, 34:16779–16790, 2021.
- Dualposenet: Category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3560–3569, 2021.
- Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
- Parallel inversion of neural radiance fields for robust pose estimation. arXiv preprint arXiv:2210.10108, 2022.
- Sift flow: Dense correspondence across scenes and its applications. IEEE transactions on pattern analysis and machine intelligence, 33(5):978–994, 2010.
- Keypose: Multi-view 3d labeling and keypoint estimation for transparent objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11602–11610, 2020.
- Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII, pages 298–315. Springer, 2022.
- David G Lowe. Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on computer vision, volume 2, pages 1150–1157. Ieee, 1999.
- David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004.
- Geodesc: Learning local descriptors by integrating geometry constraints. In Proceedings of the European conference on computer vision (ECCV), pages 168–183, 2018.
- Aslfeat: Learning local features of accurate shape and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6589–6598, 2020.
- Stacked convolutional auto-encoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning–ICANN 2011: 21st International Conference on Artificial Neural Networks, Espoo, Finland, June 14-17, 2011, Proceedings, Part I 21, pages 52–59. Springer, 2011.
- Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In International conference on machine learning, pages 2391–2400. PMLR, 2017.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pages 405–421. Springer, 2020.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
- Meta networks. In International conference on machine learning, pages 2554–2563. PMLR, 2017.
- Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5480–5490, 2022.
- Unsupervised learning of visual representations by solving jigsaw puzzles. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI, pages 69–84. Springer, 2016.
- Zephyr: Zero-shot pose hypothesis rating. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 14141–14148. IEEE, 2021.
- Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.
- Latentfusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10710–10719, 2020.
- Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
- Pvnet: Pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4561–4570, 2019.
- 3d object detection and pose estimation of unseen objects in color images with local surface embeddings. In Proceedings of the Asian Conference on Computer Vision, 2020.
- Distinctive 3d local deep descriptors. In 2020 25th International conference on pattern recognition (ICPR), pages 5720–5727. IEEE, 2021.
- A review of point cloud registration algorithms for mobile robotics. Foundations and Trends® in Robotics, 4(1):1–104, 2015.
- Focal length and object pose estimation via render and compare. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3825–3834, 2022.
- Unsupervised learning of invariant feature hierarchies with applications to object recognition. In 2007 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2007.
- Optimization as a model for few-shot learning. In International conference on learning representations, 2017.
- Neighbourhood consensus networks. Advances in neural information processing systems, 31, 2018.
- Orb: An efficient alternative to sift or surf. In 2011 International conference on computer vision, pages 2564–2571. Ieee, 2011.
- Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020.
- Self-supervised deep learning on point clouds by reconstructing space. Advances in Neural Information Processing Systems, 32, 2019.
- Jianbo Shi et al. Good features to track. In 1994 Proceedings of IEEE conference on computer vision and pattern recognition, pages 593–600. IEEE, 1994.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- James Christopher Sims. An implementation of Deep Belief Networks using restricted Boltzmann machines in Clojure. University of Rhode Island, 2016.
- Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
- Hybridpose: 6d object pose estimation under hybrid representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 431–440, 2020.
- Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6738–6748, 2022.
- Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931, 2021.
- Onepose: One-shot object pose estimation without cad models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6825–6834, 2022.
- Multi-path learning for object pose estimation across domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13916–13925, 2020.
- Implicit 3d orientation learning for 6d object detection from rgb images. In Proceedings of the european conference on computer vision (ECCV), pages 699–715, 2018.
- Real-time seamless single shot 6d object pose prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 292–301, 2018.
- Shape prior deformation for categorical 6d object pose and size estimation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 530–546. Springer, 2020.
- Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
- Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2642–2651, 2019.
- Ibrnet: Learning multi-view image-based rendering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Unsupervised learning of visual representations using videos. In Proceedings of the IEEE international conference on computer vision, pages 2794–2802, 2015.
- Disentangled implicit shape and pose learning for scalable 6d pose estimation. arXiv preprint arXiv:2107.12549, 2021.
- Edge enhanced implicit orientation learning with geometric prior for 6d pose estimation. IEEE Robotics and Automation Letters, 5(3):4931–4938, 2020.
- Learning descriptors for object recognition and 3d pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3109–3118, 2015.
- Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.
- Pose from shape: Deep pose estimation for arbitrary 3d objects. arXiv preprint arXiv:1906.05105, 2019.
- Cost volume pyramid based depth inference for multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4877–4886, 2020.
- inerf: Inverting neural radiance fields for pose estimation. In IEEE International Conference on Intelligent Robots and Systems (IROS), 2020.
- inerf: Inverting neural radiance fields for pose estimation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1323–1330. IEEE, 2021.
- Dpod: 6d pose object detector and refiner. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1941–1950, 2019.