Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning a Category-level Object Pose Estimator without Pose Annotations (2404.05626v1)

Published 8 Apr 2024 in cs.CV

Abstract: 3D object pose estimation is a challenging task. Previous works always require thousands of object images with annotated poses for learning the 3D pose correspondence, which is laborious and time-consuming for labeling. In this paper, we propose to learn a category-level 3D object pose estimator without pose annotations. Instead of using manually annotated images, we leverage diffusion models (e.g., Zero-1-to-3) to generate a set of images under controlled pose differences and propose to learn our object pose estimator with those images. Directly using the original diffusion model leads to images with noisy poses and artifacts. To tackle this issue, firstly, we exploit an image encoder, which is learned from a specially designed contrastive pose learning, to filter the unreasonable details and extract image feature maps. Additionally, we propose a novel learning strategy that allows the model to learn object poses from those generated image sets without knowing the alignment of their canonical poses. Experimental results show that our method has the capability of category-level object pose estimation from a single shot setting (as pose definition), while significantly outperforming other state-of-the-art methods on the few-shot category-level object pose estimation benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Coke: Localized contrastive learning for robust keypoint detection. In WACV, 2023.
  2. ImageNet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  3. GUESS: Gradually enriching synthesis for text-driven human motion generation. IEEE TVCG, 2024.
  4. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
  5. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  6. Denoising diffusion probabilistic models. In NeurIPS, volume 33, pages 6840–6851, 2020.
  7. Robust 3d-aware object classification via discriminative render-and-compare. arXiv preprint arXiv:2305.14668, 2023.
  8. Imagic: Text-based real image editing with diffusion models. In CVPR, pages 6007–6017, 2023.
  9. Segment anything. arXiv:2304.02643, 2023.
  10. Ep-n-p: An accurate o (n) solution to the p-n-p problem. IJCV, 81:155–166, 2009.
  11. Magic3D: High-resolution text-to-3d content creation. In CVPR, pages 300–309, 2023.
  12. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. In NeurIPS, 2023.
  13. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, pages 9298–9309, 2023.
  14. Robust category-level 6d pose estimation with coarse-to-fine rendering of neural features. In ECCV, pages 492–508, 2022.
  15. 3d bounding box estimation using deep learning and geometry. In CVPR, pages 7074–7082, 2017.
  16. 6-dof object pose from semantic keypoints. In ICRA, pages 2011–2018, 2017.
  17. DreamFusion: Text-to-3d using 2d diffusion. In ICLR, 2022.
  18. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
  19. Zero123++: a single image to consistent multi-view diffusion base model, 2023.
  20. Denoising diffusion implicit models. In ICLR, 2020.
  21. Viewpoints and keypoints. In CVPR, June 2015.
  22. ESRGAN: Enhanced super-resolution generative adversarial networks. In ECCV Workshops, 2018.
  23. Normalized object coordinate space for category-level 6d object pose and size estimation. In CVPR, pages 2642–2651, 2019.
  24. Robust object detection under occlusion with context-aware compositionalnets. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12645–12654, 2020.
  25. NeMo: Neural mesh models of contrastive features for robust 3d pose estimation. In ICLR, 2021.
  26. Neural view synthesis and matching for semi-supervised few-shot learning of 3d pose. NeurIPS, 34:7207–7219, 2021.
  27. Voge: a differentiable volume renderer using gaussian ellipsoids for analysis-by-synthesis. arXiv preprint arXiv:2205.15401, 2022.
  28. Neural textured deformable meshes for robust analysis-by-synthesis. In WACV, pages 3108–3117, 2024.
  29. Beyond pascal: A benchmark for 3d object detection in the wild. In WACV, pages 75–82, 2014.
  30. Robust category-level 3d pose estimation from diffusion-enhanced synthetic data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3446–3455, 2024.
  31. iNeRF: Inverting neural radiance fields for pose estimation. In IROS, 2021.
  32. FisherMatch: Semi-supervised rotation regression via entropy-based filtering. In CVPR, pages 11164–11173, 2022.
  33. Inversion-based style transfer with diffusion models. In CVPR, pages 10146–10156, June 2023.
  34. Starmap for category-agnostic keypoint and viewpoint estimation. In ECCV, pages 318–334, 2018.

Summary

We haven't generated a summary for this paper yet.