Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Graph-Based Approach for Category-Agnostic Pose Estimation (2311.17891v2)

Published 29 Nov 2023 in cs.CV

Abstract: Traditional 2D pose estimation models are limited by their category-specific design, making them suitable only for predefined object categories. This restriction becomes particularly challenging when dealing with novel objects due to the lack of relevant training data. To address this limitation, category-agnostic pose estimation (CAPE) was introduced. CAPE aims to enable keypoint localization for arbitrary object categories using a few-shot single model, requiring minimal support images with annotated keypoints. We present a significant departure from conventional CAPE techniques, which treat keypoints as isolated entities, by treating the input pose data as a graph. We leverage the inherent geometrical relations between keypoints through a graph-based network to break symmetry, preserve structure, and better handle occlusions. We validate our approach on the MP-100 benchmark, a comprehensive dataset comprising over 20,000 images spanning over 100 categories. Our solution boosts performance by 0.98% under a 1-shot setting, achieving a new state-of-the-art for CAPE. Additionally, we enhance the dataset with skeleton annotations. Our code and data are publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018.
  2. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  3. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  4. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  5. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 183–192, 2020.
  6. MMPose Contributors. Openmmlab pose estimation toolbox and benchmark. https://github.com/open-mmlab/mmpose, 2020.
  7. Dynamic detr: End-to-end object detection with dynamic attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2988–2997, 2021.
  8. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022a.
  9. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022b.
  10. Model-agnostic meta-learning for fast adaptation of deep networks. 2017.
  11. Fast convergence of detr with spatially modulated co-attention. In ICCV, pages 3621–3630, 2021.
  12. Inductive representation learning on large graphs. In NIPS, pages 1025–1035, 2017.
  13. Vision gnn: An image is worth graph of nodes. Advances in Neural Information Processing Systems, 35:8291–8303, 2022.
  14. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  15. Learning graph neural networks for image style transfer. In ECCV, 2022.
  16. Openpifpaf: Composite fields for semantic keypoint detection and spatio-temporal association. IEEE Transactions on Intelligent Transportation Systems, 23(8):13498–13511, 2021.
  17. Large-scale point cloud semantic segmentation with superpoint graphs. In CVPR, pages 4558–4567, 2018.
  18. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the AAAI conference on artificial intelligence, 2018.
  19. End-to-end human pose and mesh reconstruction with transformers. In IEEE CVPR, pages 1954–1963, 2021.
  20. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022a.
  21. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 143–152, 2020.
  22. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022b.
  23. Tfpose: Direct human pose estimation with transformers. arXiv preprint arXiv:2103.15320, 2021.
  24. Conditional detr for fast training convergence. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021a.
  25. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3651–3660, 2021b.
  26. Revisiting fine-tuning for few-shot learning. arXiv preprint arXiv:1910.00216, 2019.
  27. Graph neural networks exponentially lose expressive power for node classification. arXiv preprint arXiv:1905.10947, 2019.
  28. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  29. Carfusion: Combining point tracking and part detection for dynamic 3d reconstruction of vehicles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1906–1915, 2018.
  30. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
  31. Represent, compare, and learn: A similarity-aware framework for class-agnostic counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9529–9538, 2022.
  32. Matching is not enough: A two-stage framework for category-agnostic pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7308–7317, 2023.
  33. Prototypical networks for few-shot learning. NeurIPS, 2017.
  34. Apollocar3d: A large 3d car instance understanding benchmark for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5452–5462, 2019.
  35. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
  36. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
  37. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14(3):347–375, 2008.
  38. Learning combinatorial embedding networks for deep graph matching. In ICCV, pages 3056–3065, 2019a.
  39. Internimage: Exploring large-scale vision foundation models with deformable convolutions. arXiv preprint arXiv:2211.05778, 2022.
  40. Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog), 38(5):1–12, 2019b.
  41. Scene graph generation by iterative message passing. In CVPR, pages 5410–5419, 2017.
  42. Pose for everything: Towards category-agnostic pose estimation. In European Conference on Computer Vision, pages 398–416. Springer, 2022a.
  43. Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems, 35:38571–38584, 2022b.
  44. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, 2018.
  45. Graph r-cnn for scene graph generation. In ECCV, pages 670–685, 2018.
  46. Transpose: Keypoint localization via transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11802–11812, 2021.
  47. Articulated human detection with flexible mixtures of parts. IEEE transactions on pattern analysis and machine intelligence, 35(12):2878–2890, 2012.
  48. Apt-36k: A large-scale benchmark for animal pose estimation and tracking. Advances in Neural Information Processing Systems, 35:17301–17313, 2022.
  49. Ap-10k: A benchmark for animal pose estimation in the wild. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  50. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
  51. Semantic graph convolutional networks for 3d human pose regression. In IEEE CVPR, pages 3425–3435, 2019.
  52. Graformer: Graph-oriented transformer for 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20438–20447, 2022.
  53. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
Citations (8)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com