Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PViT-6D: Overclocking Vision Transformers for 6D Pose Estimation with Confidence-Level Prediction and Pose Tokens (2311.17504v1)

Published 29 Nov 2023 in cs.CV

Abstract: In the current state of 6D pose estimation, top-performing techniques depend on complex intermediate correspondences, specialized architectures, and non-end-to-end algorithms. In contrast, our research reframes the problem as a straightforward regression task by exploring the capabilities of Vision Transformers for direct 6D pose estimation through a tailored use of classification tokens. We also introduce a simple method for determining pose confidence, which can be readily integrated into most 6D pose estimation frameworks. This involves modifying the transformer architecture by decreasing the number of query elements based on the network's assessment of the scene complexity. Our method that we call Pose Vision Transformer or PViT-6D provides the benefits of simple implementation and being end-to-end learnable while outperforming current state-of-the-art methods by +0.3% ADD(-S) on Linemod-Occlusion and +2.7% ADD(-S) on the YCB-V dataset. Moreover, our method enhances both the model's interpretability and the reliability of its performance during inference.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Learning 6d object pose estimation using 3d object coordinates. In European Conference on Computer Vision, 2014.
  2. When regression meets manifold learning for object recognition and pose estimation. 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–7, 2018.
  3. End-to-end object detection with transformers. ArXiv, abs/2005.12872, 2020.
  4. Object detection and 6d pose estimation for precise robotic manipulation in unstructured environments. In International Conference on Informatics in Control, Automation and Robotics, 2017.
  5. Crt-6d: Fast 6d object pose estimation with cascaded refinement transformers. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5735–5744, 2023.
  6. So-pose: Exploiting self-occlusion for direct 6d pose estimation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 12376–12385, 2021.
  7. Lienet: Real-time monocular object instance 6d pose estimation. In British Machine Vision Conference, 2018.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020.
  9. Multiscale vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6804–6815, 2021.
  10. Cullnet: Calibrated and pose aware confidence scores for object pose estimation. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 2758–2766, 2019.
  11. Rigidity-aware detection for 6d object pose estimation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8927–8936, 2023.
  12. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015.
  13. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian Conference on Computer Vision, 2012.
  14. On evaluation of 6d object pose estimation. In ECCV Workshops (3), pages 606–619, 2016.
  15. Bop: Benchmark for 6d object pose estimation, 2018.
  16. A comprehensive review on 3d object detection and 6d pose estimation with deep learning. IEEE Access, 9:143746–143770, 2021.
  17. Single-stage 6d object pose estimation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2927–2936, 2019.
  18. Perspective flow aggregation for data-limited 6d object pose estimation. In European Conference on Computer Vision, 2022.
  19. PoET: Pose estimation transformer for single-view, multi-object 6d pose estimation. In 6th Annual Conference on Robot Learning, 2022.
  20. Dynamic filter networks. ArXiv, abs/1605.09673, 2016.
  21. Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again. 2017 IEEE International Conference on Computer Vision (ICCV), pages 1530–1538, 2017.
  22. Posenet: A convolutional network for real-time 6-dof camera relocalization. 2015 IEEE International Conference on Computer Vision (ICCV), pages 2938–2946, 2015.
  23. 3d-rcnn: Instance-level 3d object reconstruction via render-and-compare. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3559–3568, 2018.
  24. Cosypose: Consistent multi-view multi-object 6d pose estimation. In European Conference on Computer Vision, 2020.
  25. Mvitv2: Improved multiscale vision transformers for classification and detection. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4794–4804, 2022.
  26. Cdpn: Coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7677–7686, 2019.
  27. Swin transformer v2: Scaling up capacity and resolution. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11999–12009, 2021.
  28. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017.
  29. Detrs beat yolos on real-time object detection. ArXiv, abs/2304.08069, 2023.
  30. Roi-10d: Monocular lifting of 2d detection to 6d pose and metric shape. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2064–2073, 2018.
  31. Pose estimation for augmented reality: A hands-on survey. IEEE Transactions on Visualization and Computer Graphics, 22:2633–2651, 2016.
  32. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  33. Pvnet: Pixel-wise voting network for 6dof pose estimation. CoRR, abs/1812.11788, 2018.
  34. You only look once: Unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2015.
  35. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:1137–1149, 2015.
  36. Disentangling monocular 3d object detection: From single to multi-class recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1219–1231, 2022.
  37. Hybridpose: 6d object pose estimation under hybrid representations, 2020.
  38. Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. In CVPR, pages 6728–6738, 2022.
  39. Real-time seamless single shot 6d object pose prediction. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 292–301, 2017.
  40. Fcos: Fully convolutional one-stage object detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9626–9635, 2019.
  41. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 2020.
  42. Deep object pose estimation for semantic robotic grasping of household objects. ArXiv, abs/1809.10790, 2018.
  43. Maxvit: Multi-axis vision transformer. In European Conference on Computer Vision, 2022.
  44. Attention is all you need. In Neural Information Processing Systems, 2017.
  45. Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16606–16616, 2021.
  46. 6d-vnet: End-to-end 6dof vehicle pose estimation from monocular rgb images. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1238–1247, 2019.
  47. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. CoRR, abs/1711.00199, 2017.
  48. Vitpose: Simple vision transformer baselines for human pose estimation. ArXiv, abs/2204.12484, 2022.
  49. Varifocalnet: An iou-aware dense object detector. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8510–8519, 2020.
  50. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. ArXiv, abs/2203.03605, 2022.
  51. On the continuity of rotation representations in neural networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5738–5746, 2019.
Citations (1)

Summary

We haven't generated a summary for this paper yet.