Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and Beyond (2312.15612v1)

Published 25 Dec 2023 in cs.CV

Abstract: Animal Pose Estimation and Tracking (APT) is a critical task in detecting and monitoring the keypoints of animals across a series of video frames, which is essential for understanding animal behavior. Past works relating to animals have primarily focused on either animal tracking or single-frame animal pose estimation only, neglecting the integration of both aspects. The absence of comprehensive APT datasets inhibits the progression and evaluation of animal pose estimation and tracking methods based on videos, thereby constraining their real-world applications. To fill this gap, we introduce APTv2, the pioneering large-scale benchmark for animal pose estimation and tracking. APTv2 comprises 2,749 video clips filtered and collected from 30 distinct animal species. Each video clip includes 15 frames, culminating in a total of 41,235 frames. Following meticulous manual annotation and stringent verification, we provide high-quality keypoint and tracking annotations for a total of 84,611 animal instances, split into easy and hard subsets based on the number of instances that exists in the frame. With APTv2 as the foundation, we establish a simple baseline method named \posetrackmethodname and provide benchmarks for representative models across three tracks: (1) single-frame animal pose estimation track to evaluate both intra- and inter-domain transfer learning performance, (2) low-data transfer and generalization track to evaluate the inter-species domain generalization performance, and (3) animal pose tracking track. Our experimental results deliver key empirical insights, demonstrating that APTv2 serves as a valuable benchmark for animal pose estimation and tracking. It also presents new challenges and opportunities for future research. The code and dataset are released at \href{https://github.com/ViTAE-Transformer/APTv2}{https://github.com/ViTAE-Transformer/APTv2}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Posetrack: A benchmark for human pose estimation and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5167–5176, 2018.
  2. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3686–3693, 2014.
  3. Deepbehavior: A deep learning toolbox for automated analysis of animal and human behavior imaging data. Frontiers in systems neuroscience, 13:20, 2019.
  4. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision, pages 850–865. Springer, 2016.
  5. Cross-domain adaptation for animal pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2019.
  6. Personalizing human video pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3063–3072, 2016.
  7. M. Contributors. Openmmlab pose estimation toolbox and benchmark. https://github.com/open-mmlab/mmpose, 2020.
  8. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  10. Spatiotemporal learning transformer for video-based human pose estimation. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  11. Deepposekit, a software toolkit for fast and robust animal pose estimation using deep learning. Elife, 8:e47994, 2019.
  12. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  13. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  14. Stdformer: Spatial-temporal motion transformer for multiple object tracking. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  15. Towards understanding action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3192–3199, 2013.
  16. Macaquepose: A novel ‘in the wild’macaque monkey pose dataset for markerless motion capture. bioRxiv, 2020.
  17. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4282–4291, 2019.
  18. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10863–10872, 2019.
  19. Pose recognition with cascade transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1944–1953, June 2021.
  20. Atrw: A benchmark for amur tiger re-identification in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2590–2598, 2020.
  21. Exploring plain vision transformer backbones for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  22. Tokenpose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  23. Swintrack: A simple and strong baseline for transformer tracking. arXiv preprint arXiv:2112.00995, 2021.
  24. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, pages 740–755. Springer, 2014.
  25. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
  26. Pretraining boosts out-of-domain robustness for pose estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1859–1868, 2021.
  27. Segvit: Semantic segmentation with plain vision transformers. In Zhang, Bowen and Tian, Zhi and Tang, Quan and Chu, Xiangxiang and Wei, Xiaolin and Shen, Chunhua and Liu, Yifan, 2022.
  28. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision, pages 483–499. Springer, 2016.
  29. Animal kingdom: A large and diverse dataset for animal behavior understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19023–19034, 2022.
  30. Efficient single-object tracker based on local-global feature fusion. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  31. Fast cnn-based single-person 2d human pose estimation for autonomous systems. IEEE Transactions on Circuits and Systems for Video Technology, 33(3):1262–1275, 2022.
  32. Fast animal pose estimation using deep neural networks. Nature methods, 16(1):117–125, 2019.
  33. Parsing human motion with stretchable models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1281–1288, 2011.
  34. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5693–5703, 2019.
  35. Ftcm: Frequency-temporal collaborative module for efficient 3d human pose estimation in video. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  36. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1571–1580, 2021.
  37. A deep clustering via automatic feature embedded learning for human activity recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32(1):210–223, 2021.
  38. Uformpose: A u-shaped hierarchical multi-scale keypoint-aware framework for human pose estimation. IEEE Transactions on Circuits and Systems for Video Technology, 33(4):1697–1709, 2022.
  39. Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017.
  40. Spatiotemporal multimodal learning with 3d cnns for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32(3):1250–1261, 2021.
  41. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision, pages 466–481, 2018.
  42. Rpm 2.0: Rf-based pose machines for multi-person 3d pose estimation. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  43. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
  44. Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems, 2022.
  45. Vitae: Vision transformer advanced by exploring intrinsic inductive bias. Advances in Neural Information Processing Systems, 34, 2021.
  46. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10448–10457, 2021.
  47. Transpose: Keypoint localization via transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  48. Apt-36k: A large-scale benchmark for animal pose estimation and tracking. Advances in Neural Information Processing Systems, 35:17301–17313, 2022.
  49. Ap-10k: A benchmark for animal pose estimation in the wild. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  50. Hrformer: High-resolution transformer for dense prediction. Advances in Neural Information Processing Systems, 2021.
  51. J. Zhang and D. Tao. Empowering things with intelligence: a survey of the progress, challenges, and opportunities in artificial intelligence of things. IEEE Internet of Things Journal, 8(10):7789–7817, 2020.
  52. Animaltrack: A large-scale benchmark for multi-animal tracking in the wild. arXiv preprint arXiv:2205.00158, 2022.
  53. Pose2seg: Detection free human instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 889–898, 2019.
  54. Bytetrack: Multi-object tracking by associating every detection box. In European Conference on Computer Vision, pages 1–21. Springer, 2022.
  55. Ocean: Object-aware anchor-free tracking. In Proceedings of the European Conference on Computer Vision, pages 771–787. Springer, 2020.
  56. Memory network with pixel-level spatio-temporal learning for visual object tracking. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
Citations (4)

Summary

We haven't generated a summary for this paper yet.