Diffusion-Driven Self-Supervised Learning for Shape Reconstruction and Pose Estimation (2403.12728v1)
Abstract: Fully-supervised category-level pose estimation aims to determine the 6-DoF poses of unseen instances from known categories, requiring expensive mannual labeling costs. Recently, various self-supervised category-level pose estimation methods have been proposed to reduce the requirement of the annotated datasets. However, most methods rely on synthetic data or 3D CAD model for self-supervised training, and they are typically limited to addressing single-object pose problems without considering multi-objective tasks or shape reconstruction. To overcome these challenges and limitations, we introduce a diffusion-driven self-supervised network for multi-object shape reconstruction and categorical pose estimation, only leveraging the shape priors. Specifically, to capture the SE(3)-equivariant pose features and 3D scale-invariant shape information, we present a Prior-Aware Pyramid 3D Point Transformer in our network. This module adopts a point convolutional layer with radial-kernels for pose-aware learning and a 3D scale-invariant graph convolution layer for object-level shape representation, respectively. Furthermore, we introduce a pretrain-to-refine self-supervised training paradigm to train our network. It enables proposed network to capture the associations between shape priors and observations, addressing the challenge of intra-class shape variations by utilising the diffusion mechanism. Extensive experiments conducted on four public datasets and a self-built dataset demonstrate that our method significantly outperforms state-of-the-art self-supervised category-level baselines and even surpasses some fully-supervised instance-level and category-level methods.
- G. Wang, F. Manhardt, J. Shao, X. Ji, N. Navab, and F. Tombari, “Self6d: Self-supervised monocular 6d object pose estimation,” in 16th Eur. Conf. Comput. Vis., 2020, pp. 108–125.
- G. Wang, F. Manhardt, X. Liu, X. Ji, and F. Tombari, “Occlusion-aware self-supervised monocular 6d object pose estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 3, pp. 1788–1803, 2021.
- H. Chen, F. Manhardt, N. Navab, and B. Busam, “Texpose: Neural texture learning for self-supervised 6d object pose estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 4841–4852.
- Y. Hai, R. Song, J. Li, D. Ferstl, and Y. Hu, “Pseudo flow consistency for self-supervised 6d object pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 14 075–14 085.
- Q. Gu, B. Okorn, and D. Held, “Ossid: online self-supervised instance detection by (and for) pose estimation,” IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 3022–3029, 2022.
- G. Zhou, D. Wang, Y. Yan, H. Chen, and Q. Chen, “Semi-supervised 6d object pose estimation without using real annotations,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 8, pp. 5163–5174, 2021.
- F. Li, S. R. Vutukur, H. Yu, I. Shugurov, B. Busam, S. Yang, and S. Ilic, “Nerf-pose: A first-reconstruct-then-regress approach for weakly-supervised 6d object pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 2123–2133.
- H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas, “Normalized object coordinate space for category-level 6d object pose and size estimation,” in Proc. IEEE. Conf. Comput. Vis. Pattern Recognit., 2019, pp. 2642–2651.
- K. Chen and Q. Dou, “Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 2773–2782.
- M. Tian, M. H. Ang, and G. H. Lee, “Shape prior deformation for categorical 6d object pose and size estimation,” in 16th Eur. Conf. Comput. Vis., 2020, pp. 530–546.
- R. Wang, X. Wang, T. Li, R. Yang, M. Wan, and W. Liu, “Query6dof: Learning sparse queries as implicit shape prior for category-level 6dof pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 14 055–14 064.
- T. Lee, B.-U. Lee, I. Shin, J. Choe, U. Shin, I. S. Kweon, and K.-J. Yoon, “Uda-cope: unsupervised domain adaptation for category-level object pose estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 14 891–14 900.
- J. Lin, Z. Wei, C. Ding, and K. Jia, “Category-level 6d object pose and size estimation using self-supervised deep prior deformation networks,” in Eur. Conf. Comput. Vis., 2022, pp. 19–34.
- W. Peng, J. Yan, H. Wen, and Y. Sun, “Self-supervised category-level 6d object pose estimation with deep implicit shape representation,” in Proc. 36th AAAI Conf. Artif. Intell., 2022, pp. 2082–2090.
- X. Liu, J. Zhang, R. Hu, H. Huang, H. Wang, and L. Yi, “Self-supervised category-level articulated object pose estimation with part-level se (3) equivariance,” in Proc. Int. Conf. Learn. Representations, 2023.
- X. Li, Y. Weng, L. Yi, L. J. Guibas, A. Abbott, S. Song, and H. Wang, “Leveraging se (3) equivariance for self-supervised category-level object pose estimation from point clouds,” Proc. Adv. Neural Inf. Process. Syst., vol. 34, pp. 15 370–15 381, 2021.
- M. Zaccaria, F. Manhardt, Y. Di, F. Tombari, J. Aleotti, and M. Giorgini, “Self-supervised category-level 6d object pose estimation with optical flow consistency,” IEEE Robot. Autom. Lett., vol. 8, no. 5, pp. 2510–2517, 2023.
- K. Zhang, Y. Fu, S. Borse, H. Cai, F. Porikli, and X. Wang, “Self-supervised geometric correspondence for category-level 6d object pose estimation in the wild,” arXiv preprint arXiv:2210.07199, 2022.
- Y. Ze and X. Wang, “Category-level 6d object pose estimation in the wild: A semi-supervised learning approach and a new dataset,” Proc. Adv. Neural Inf. Process. Syst., vol. 35, pp. 27 469–27 483, 2022.
- Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” arXiv preprint arXiv:1711.00199, 2017.
- X. Deng, Y. Xiang, A. Mousavian, C. Eppner, T. Bretl, and D. Fox, “Self-supervised 6d object pose estimation for robot manipulation,” in Proc. IEEE Int. Conf. Robot. Automat., 2020, pp. 3665–3671.
- J. Sock, G. Garcia-Hernando, A. Armagan, and T.-K. Kim, “Introducing pose consistency and warp-alignment for self-supervised 6d object pose estimation in color images,” in Proc. IEEE Int. Conf. 3D Vis., 2020, pp. 291–300.
- Y. Di, F. Manhardt, G. Wang, X. Ji, N. Navab, and F. Tombari, “So-pose: Exploiting self-occlusion for direct 6d pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 12 396–12 405.
- Y. He, H. Fan, H. Huang, Q. Chen, and J. Sun, “Towards self-supervised category-level object pose and size estimation,” arXiv preprint arXiv:2203.02884, 2022.
- F. Manhardt, G. Wang, B. Busam, M. Nickel, S. Meier, L. Minciullo, X. Ji, and N. Navab, “Cps++: Improving class-level 6d pose and shape estimation from monocular images with self-supervised learning,” arXiv preprint arXiv:2003.05848, 2020.
- S. Yu, D.-H. Zhai, and Y. Xia, “Robotic grasp detection based on category-level object pose estimation with self-supervised learning,” IEEE/ASME Trans. Mechatronics, 2023.
- I. Shugurov, S. Zakharov, and S. Ilic, “Dpodv2: Dense correspondence-based 6 dof pose estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 11, pp. 7417–7435, 2021.
- M. Zhu, K. G. Derpanis, Y. Yang, S. Brahmbhatt, M. Zhang, C. Phillips, M. Lecce, and K. Daniilidis, “Single image 3d object detection and pose estimation for grasping,” in Proc. IEEE Int. Conf. Robot. Automat., 2014, pp. 3936–3943.
- W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab, “Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 1521–1529.
- A. Tejani, R. Kouskouridas, A. Doumanoglou, D. Tang, and T.-K. Kim, “Latent-class hough forests for 6 dof object pose estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 1, pp. 119–132, 2017.
- S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel-wise voting network for 6dof pose estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4561–4570.
- J. Guo, X. Xing, W. Quan, D.-M. Yan, Q. Gu, Y. Liu, and X. Zhang, “Efficient center voting for object detection and 6d pose estimation in 3d point cloud,” IEEE Trans. Image Process., vol. 30, pp. 5072–5084, 2021.
- G. Zhou, Y. Yan, D. Wang, and Q. Chen, “A novel depth and color feature fusion framework for 6d object pose estimation,” IEEE Trans. Multimedia, vol. 23, pp. 1630–1639, 2020.
- W.-L. Huang, C.-Y. Hung, and I.-C. Lin, “Confidence-based 6d object pose estimation,” IEEE Trans. Multimedia, vol. 24, pp. 3025–3035, 2021.
- D. Wang, G. Zhou, Y. Yan, H. Chen, and Q. Chen, “Geopose: Dense reconstruction guided 6d object pose estimation with geometric consistency,” IEEE Trans. Multimedia, vol. 24, pp. 4394–4408, 2021.
- J. Liu, Z. Cao, Y. Tang, X. Liu, and M. Tan, “Category-level 6d object pose estimation with structure encoder and reasoning attention,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 10, pp. 6728–6740, 2022.
- J. Wang, K. Chen, and Q. Dou, “Category-level 6d object pose estimation via cascaded relation and recurrent reconstruction networks,” in Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2021, pp. 4807–4814.
- W. Chen, X. Jia, H. J. Chang, J. Duan, L. Shen, and A. Leonardis, “Fs-net: Fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1581–1590.
- J. Lin, Z. Wei, Z. Li, S. Xu, K. Jia, and Y. Li, “Dualposenet: Category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 3560–3569.
- X. Liu, G. Wang, Y. Li, and X. Ji, “Catre: Iterative point clouds alignment for category-level object pose refinement,” in 17th Eur. Conf. Comput. Vis., 2022, pp. 499–516.
- Y. Di, R. Zhang, Z. Lou, F. Manhardt, X. Ji, N. Navab, and F. Tombari, “Gpv-pose: Category-level object pose estimation via geometry-guided point-wise voting,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 6781–6791.
- H. Lin, Z. Liu, C. Cheang, Y. Fu, G. Guo, and X. Xue, “Sar-net: shape alignment and recovery network for category-level 6d object pose and size estimation,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 6707–6717.
- D. Chen, J. Li, Z. Wang, and K. Xu, “Learning canonical shape space for category-level 6d object pose and size estimation,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 11 973–11 982.
- L. Zou, Z. Huang, N. Gu, and G. Wang, “6d-vit: Category-level 6d object pose estimation via transformer-based instance representation learning,” IEEE Trans. Image Process., vol. 31, pp. 6907–6921, 2022.
- M. Z. Irshad, T. Kollar, M. Laskey, K. Stone, and Z. Kira, “Centersnap: Single-shot multi-object 3d shape reconstruction and categorical 6d pose and size estimation,” in Proc. IEEE Int. Conf. Robot. Automat., 2022, pp. 10 632–10 640.
- C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 652–660.
- C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Proc. Adv. Neural inf. Process. Syst., vol. 30, 2017.
- W. Yuan, T. Khot, D. Held, C. Mertz, and M. Hebert, “Pcn: Point completion network,” in Proc. IEEE Int. Conf. 3D Vis., 2018, pp. 728–737.
- Y. Chang, C. Jung, and Y. Xu, “Finerpcn: High fidelity point cloud completion network using pointwise convolution,” Neurocomputing, vol. 460, pp. 266–276, 2021.
- Y. Nie, Y. Lin, X. Han, S. Guo, J. Chang, S. Cui, J. Zhang et al., “Skeleton-bridged point completion: From global inference to local adjustment,” Proc. Adv. Neural Inf. Process. Syst., vol. 33, pp. 16 119–16 130, 2020.
- W. Zhang, C. Long, Q. Yan, A. L. Chow, and C. Xiao, “Multi-stage point completion network with critical set supervision,” Computer Aided Geometric Design, vol. 82, p. 101925, 2020.
- K. Zhang, X. Yang, Y. Wu, and C. Jin, “Srpcn: Structure retrieval based point completion network,” arXiv preprint arXiv:2202.02669, 2022.
- T. Hu, Z. Han, and M. Zwicker, “3d shape completion with multi-view consistent inference,” in Proc. 34th AAAI Conf. Artif. Intell., vol. 34, no. 07, 2020, pp. 10 997–11 004.
- B. Gong, Y. Nie, Y. Lin, X. Han, and Y. Yu, “Me-pcn: Point completion conditioned on mask emptiness,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 12 488–12 497.
- G. Qian, A. Abualshour, G. Li, A. Thabet, and B. Ghanem, “Pu-gcn: Point cloud upsampling using graph convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 11 683–11 692.
- X. Pan, Z. Xia, S. Song, L. E. Li, and G. Huang, “3d object detection with pointformer,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 7463–7472.
- H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 16 259–16 268.
- J. Zhang, X. Chen, Z. Cai, L. Pan, H. Zhao, S. Yi, C. K. Yeo, B. Dai, and C. C. Loy, “Unsupervised 3d shape completion through gan inversion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1768–1777.
- P. Spurek, A. Kasymov, M. Mazur, D. Janik, S. K. Tadeja, J. Tabor, T. Trzciński et al., “Hyperpocket: Generative point cloud completion,” in Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2022, pp. 6848–6853.
- P. Mittal, Y.-C. Cheng, M. Singh, and S. Tulsiani, “Autosdf: Shape priors for 3d completion, reconstruction and generation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 306–315.
- Y. You, R. Shi, W. Wang, and C. Lu, “Cppf: Towards robust category-level 9d pose estimation in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 6866–6875.
- H. Chen, S. Liu, W. Chen, H. Li, and R. Hill, “Equivariant point network for 3d point cloud analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 14 514–14 523.
- H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas, “Kpconv: Flexible and deformable convolution for point clouds,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 6411–6420.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Proc. Adv. Neural Inf. Process. Syst., vol. 33, pp. 6840–6851, 2020.
- L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 3836–3847.
- J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2256–2265.
- M. Z. Irshad, S. Zakharov, R. Ambrus, T. Kollar, Z. Kira, and A. Gaidon, “Shapo: Implicit representations for multi-object shape, appearance, and pose optimization,” in 17th Eur. Conf. Comput. Vis., 2022.
- L. Zheng, C. Wang, Y. Sun, E. Dasgupta, H. Chen, A. Leonardis, W. Zhang, and H. J. Chang, “Hs-pose: Hybrid scope feature extraction for category-level object pose estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 17 163–17 173.
- J. Liu, Y. Chen, X. Ye, and X. Qi, “Prior-free category-level pose estimation with implicit space transformation,” arXiv preprint arXiv:2303.13479, 2023.
- J. Lin, Z. Wei, Y. Zhang, and K. Jia, “Vi-net: Boosting category-level 6d object pose estimation via learning decoupled rotations on the spherical representations,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 14 001–14 011.
- L. Zou, Z. Huang, N. Gu, and G. Wang, “Gpt-cope: A graph-guided point transformer for category-level object pose estimation,” IEEE Trans. Circuits Syst. Video Technol., 2023.
- J. Liu, W. Sun, C. Liu, H. Yang, X. Zhang, and A. Mian, “Mh6d: Multi-hypothesis consistency learning for category-level 6-d object pose estimation,” IEEE Trans. Neural Netw. Learn. Syst., 2024.
- A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.
- Y. Li, G. Wang, X. Ji, Y. Xiang, and D. Fox, “Deepim: Deep iterative matching for 6d pose estimation,” in 15th Eur. Conf. Comput. Vis., 2018, pp. 683–698.
- Y. Labbé, J. Carpentier, M. Aubry, and J. Sivic, “Cosypose: Consistent multi-view multi-object 6d pose estimation,” in 16th Eur. Conf. Comput. Vis., 2020, pp. 574–591.
- G. Li, Y. Li, Z. Ye, Q. Zhang, T. Kong, Z. Cui, and G. Zhang, “Generative category-level shape and pose estimation with semantic primitives,” in CoRL, 2023, pp. 1390–1400.
- Jingtao Sun (3 papers)
- Yaonan Wang (51 papers)
- Mingtao Feng (23 papers)
- Chao Ding (45 papers)
- Mike Zheng Shou (165 papers)
- Ajmal Saeed Mian (8 papers)