BPJDet: Extended Object Representation for Generic Body-Part Joint Detection (2304.10765v2)
Abstract: Detection of human body and its parts has been intensively studied. However, most of CNNs-based detectors are trained independently, making it difficult to associate detected parts with body. In this paper, we focus on the joint detection of human body and its parts. Specifically, we propose a novel extended object representation integrating center-offsets of body parts, and construct an end-to-end generic Body-Part Joint Detector (BPJDet). In this way, body-part associations are neatly embedded in a unified representation containing both semantic and geometric contents. Therefore, we can optimize multi-loss to tackle multi-tasks synergistically. Moreover, this representation is suitable for anchor-based and anchor-free detectors. BPJDet does not suffer from error-prone post matching, and keeps a better trade-off between speed and accuracy. Furthermore, BPJDet can be generalized to detect body-part or body-parts of either human or quadruped animals. To verify the superiority of BPJDet, we conduct experiments on datasets of body-part (CityPersons, CrowdHuman and BodyHands) and body-parts (COCOHumanParts and Animals5C). While keeping high detection accuracy, BPJDet achieves state-of-the-art association performance on all datasets. Besides, we show benefits of advanced body-part association capability by improving performance of two representative downstream applications: accurate crowd head detection and hand contact estimation. Project is available in https://hnuzhy.github.io/projects/BPJDet.
- P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 4, pp. 743–761, 2011.
- S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Occlusion-aware r-cnn: detecting pedestrians in a crowd,” in European Conference on Computer Vision, 2018, pp. 637–653.
- X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen, “Repulsion loss: Detecting pedestrians in a crowd,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7774–7783.
- C. Chi, S. Zhang, J. Xing, Z. Lei, S. Z. Li, and X. Zou, “Pedhunter: Occlusion robust pedestrian detector in crowded scenes,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 10 639–10 646.
- X. Chu, A. Zheng, X. Zhang, and J. Sun, “Detection in crowded scenes: One proposal, multiple predictions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 214–12 223.
- D. Li, X. Chen, Z. Zhang, and K. Huang, “Learning deep context-aware features over body and latent parts for person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 384–393.
- M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. Hoi, “Deep learning for person re-identification: A survey and outlook,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 2872–2893, 2021.
- X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking objects as points,” in European Conference on Computer Vision. Springer, 2020, pp. 474–490.
- R. Sundararaman, C. De Almeida Braga, E. Marchand, and J. Pettre, “Tracking pedestrian heads in dense crowd,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 3865–3875.
- W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, and T.-K. Kim, “Multiple object tracking: A literature review,” Artificial Intelligence, vol. 293, p. 103448, 2021.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2961–2969.
- H.-S. Fang, J. Li, H. Tang, C. Xu, H. Zhu, Y. Xiu, Y.-L. Li, and C. Lu, “Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- P. Hu and D. Ramanan, “Finding tiny faces,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 951–959.
- J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “Retinaface: Single-shot multi-level face localisation in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 5203–5212.
- T.-H. Vu, A. Osokin, and I. Laptev, “Context-aware cnns for person head detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2893–2901.
- C. Le, H. Ma, X. Wang, and X. Li, “Key parts context and scene geometry in human head detection,” in IEEE International Conference on Image Processing. IEEE, 2018, pp. 1897–1901.
- H. Zhou, F. Jiang, and R. Shen, “Who are raising their hands? hand-raiser seeking based on object detection and pose estimation,” in Asian Conference on Machine Learning. PMLR, 2018, pp. 470–485.
- S. Narasimhaswamy, T. Nguyen, M. Huang, and M. Hoai, “Whose hands are these? hand detection and hand-body association in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 4889–4899.
- J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4690–4699.
- H. Idrees, K. Soomro, and M. Shah, “Detecting humans in dense crowds using locally-consistent scale prior and global occlusion reasoning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 10, pp. 1986–1998, 2015.
- D. B. Sam, S. V. Peri, M. N. Sundararaman, A. Kamath, and R. V. Babu, “Locate, size, and count: accurately resolving people in dense crowds via detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 8, pp. 2739–2751, 2020.
- L. Ge, H. Liang, J. Yuan, and D. Thalmann, “Real-time 3d hand pose estimation with 3d convolutional neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 4, pp. 956–970, 2018.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in Neural Information Processing Systems, vol. 28, 2015.
- T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.
- T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988.
- J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision. Springer, 2014, pp. 740–755.
- S. Zhang, R. Benenson, and B. Schiele, “Citypersons: A diverse dataset for pedestrian detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3213–3221.
- S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun, “Crowdhuman: A benchmark for detecting human in a crowd,” arXiv preprint arXiv:1805.00123, 2018.
- Q. Gao, Z. Ju, Y. Chen, Q. Wang, Y. Zhao, and S. Lai, “Parallel dual-hand detection by using hand and body features for robot teleoperation,” IEEE Transactions on Human-Machine Systems, 2023.
- Z. Xiong, Z. Yao, Y. Ma, and X. Wu, “Vikingdet: A real-time person and face detector for surveillance cameras,” in IEEE International Conference on Advanced Video and Signal Based Surveillance. IEEE, 2019, pp. 1–7.
- X. Shu, Y. Tao, R. Qiao, B. Ke, W. Wen, and B. Ren, “Head and body: Unified detector and graph network for person search in media,” arXiv preprint arXiv:2111.13888, 2021.
- C. Ding, K. Wang, P. Wang, and D. Tao, “Multi-task learning with coarse priors for robust part-aware person re-identification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, pp. 1474–1488, 2020.
- V. Somers, C. De Vleeschouwer, and A. Alahi, “Body part-based representation learning for occluded person re-identification,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2023, pp. 1613–1623.
- S. Zhang, X. Cao, G.-J. Qi, Z. Song, and J. Zhou, “Aiparsing: Anchor-free instance-level human parsing,” IEEE Transactions on Image Processing, vol. 31, pp. 5599–5612, 2022.
- L. Yang, Q. Song, Z. Wang, Z. Liu, S. Xu, and Z. Li, “Quality-aware network for human parsing,” IEEE Transactions on Multimedia, 2022.
- K. Liu, O. Choi, J. Wang, and W. Hwang, “Cdgnet: Class distribution guided network for human parsing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 4473–4482.
- S. Brahmbhatt, C. Tang, C. D. Twigg, C. C. Kemp, and J. Hays, “Contactpose: A dataset of grasps with object contact and hand pose,” in European Conference on Computer Vision. Springer, 2020, pp. 361–378.
- S. Narasimhaswamy, T. Nguyen, and M. H. Nguyen, “Detecting hands and recognizing physical contact in the wild,” Advances in Neural Information Processing Systems, vol. 33, pp. 7841–7851, 2020.
- D. Shan, J. Geng, M. Shu, and D. F. Fouhey, “Understanding human hands in contact at internet scale,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 9869–9878.
- L. Muller, A. A. Osman, S. Tang, C.-H. P. Huang, and M. J. Black, “On self-contact and human pose,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 9990–9999.
- Y. Chen, S. K. Dwivedi, M. J. Black, and D. Tzionas, “Detecting human-object contact in images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
- Y.-L. Li, L. Xu, X. Liu, X. Huang, Y. Xu, S. Wang, H.-S. Fang, Z. Ma, M. Chen, and C. Lu, “Pastanet: Toward human activity knowledge engine,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 382–391.
- X. Wu, Y.-L. Li, X. Liu, J. Zhang, Y. Wu, and C. Lu, “Mining cross-person cues for body-part interactiveness learning in hoi detection,” in European Conference on Computer Vision. Springer, 2022, pp. 121–136.
- Y.-L. Li, X. Liu, X. Wu, Y. Li, Z. Qiu, L. Xu, Y. Xu, H.-S. Fang, and C. Lu, “Hake: a knowledge engine foundation for human activity understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- J. Lim, V. M. Baskaran, J. M.-Y. Lim, K. Wong, J. See, and M. Tistarelli, “Ernet: An efficient and reliable human-object interaction detection network,” IEEE Transactions on Image Processing, vol. 32, pp. 964–979, 2023.
- A. A. Osman, T. Bolkart, D. Tzionas, and M. J. Black, “Supr: A sparse unified part-based human representation,” in European Conference on Computer Vision. Springer, 2022, pp. 568–585.
- S.-Y. Su, T. Bagautdinov, and H. Rhodin, “Danbo: Disentangled articulated neural body representations via graph neural networks,” in European Conference on Computer Vision. Springer, 2022, pp. 107–124.
- M. Mihajlovic, S. Saito, A. Bansal, M. Zollhoefer, and S. Tang, “Coap: Compositional articulated occupancy of people,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 201–13 210.
- M. Dantone, J. Gall, C. Leistner, and L. Van Gool, “Body parts dependent joint regressors for human pose estimation in still images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 11, pp. 2131–2143, 2014.
- H. Zhang, J. Cao, G. Lu, W. Ouyang, and Z. Sun, “Learning 3d human shape and pose from dense body parts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2610–2627, 2020.
- S. Kreiss, L. Bertoni, and A. Alahi, “Openpifpaf: Composite fields for semantic keypoint detection and spatio-temporal association,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 8, pp. 13 498–13 511, 2021.
- Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 01, pp. 172–186, 2021.
- K. Zhang, F. Xiong, P. Sun, L. Hu, B. Li, and G. Yu, “Double anchor r-cnn for human detection in a crowd,” arXiv preprint arXiv:1909.09998, 2019.
- C. Chi, S. Zhang, J. Xing, Z. Lei, S. Z. Li, and X. Zou, “Relational learning for joint head and human detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 10 647–10 654.
- J. Wan, J. Deng, X. Qiu, and F. Zhou, “Body-face joint detection via embedding and head hook,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 2959–2968.
- X. Li, L. Yang, Q. Song, and F. Zhou, “Detector-in-detector: Multi-level analysis for human-parts,” in Asian Conference on Computer Vision. Springer, 2019, pp. 228–240.
- L. Yang, Q. Song, Z. Wang, M. Hu, and C. Liu, “Hier r-cnn: Instance-level human parts detection and a new benchmark,” IEEE Transactions on Image Processing, vol. 30, pp. 39–54, 2020.
- J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7263–7271.
- C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Scaled-yolov4: Scaling cross stage partial network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 029–13 038.
- G. Jocher, “Ultralytics yolov5,” 2020. [Online]. Available: https://github.com/ultralytics/yolov5
- C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
- S. Xu, X. Wang, W. Lv, Q. Chang, C. Cui, K. Deng, G. Wang, Q. Dang, S. Wei, Y. Du et al., “Pp-yoloe: An evolved version of yolo,” arXiv preprint arXiv:2203.16250, 2022.
- G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics
- X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv preprint arXiv:1904.07850, 2019.
- Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 9627–9636.
- Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7291–7299.
- X. Nie, J. Feng, J. Zhang, and S. Yan, “Single-stage multi-person pose machines,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6951–6960.
- W. Mao, Z. Tian, X. Wang, and C. Shen, “Fcpose: Fully convolutional multi-person pose estimation with dynamic instance-aware convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 9034–9043.
- Y. Xiao, X. J. Wang, D. Yu, G. Wang, Q. Zhang, and H. Mingshu, “Adaptivepose: Human parts as adaptive points,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2813–2821.
- J. Cao, H. Tang, H.-S. Fang, X. Shen, C. Lu, and Y.-W. Tai, “Cross-domain adaptation for animal pose estimation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 9498–9507.
- H. Yu, Y. Xu, J. Zhang, W. Zhao, Z. Guan, and D. Tao, “Ap-10k: A benchmark for animal pose estimation in the wild,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- H. Zhou, F. Jiang, and H. Lu, “Body-part joint detection and association via extended object representation,” in IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2023.
- S. Liu, D. Huang, and Y. Wang, “Adaptive nms: Refining pedestrian detection in a crowd,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6459–6468.
- Z. Xu, B. Li, Y. Yuan, and A. Dang, “Beta r-cnn: Looking into pedestrian detection from another perspective,” Advances in Neural Information Processing Systems, vol. 33, pp. 19 953–19 963, 2020.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision. Springer, 2020, pp. 213–229.
- M. Lin, C. Li, X. Bu, M. Sun, C. Lin, J. Yan, W. Ouyang, and Z. Deng, “Detr for crowd pedestrian detection,” arXiv preprint arXiv:2012.06785, 2020.
- X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” International Conference on Learning Representations, 2021.
- P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang et al., “Sparse r-cnn: End-to-end object detection with learnable proposals,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 454–14 463.
- A. Zheng, Y. Zhang, X. Zhang, X. Qi, and J. Sun, “Progressive end-to-end object detection in crowded scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 857–866.
- D. Peng, Z. Sun, Z. Chen, Z. Cai, L. Xie, and L. Jin, “Detecting heads using feature refine net and cascaded multi-scale architecture,” in International Conference on Pattern Recognition. IEEE, 2018, pp. 2528–2533.
- H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in European Conference on Computer Vision, 2018, pp. 734–750.
- C. Feng, Y. Zhong, Y. Gao, M. R. Scott, and W. Huang, “Tood: Task-aligned one-stage object detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 3490–3499.
- X. Li, C. Lv, W. Wang, G. Li, L. Yang, and J. Yang, “Generalized focal loss: Towards efficient representation learning for dense object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3139–3153, 2022.
- F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “Dn-detr: Accelerate detr training by introducing query denoising,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 619–13 627.
- D. Maji, S. Nagori, M. Mathew, and D. Poddar, “Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 2637–2646.
- J. Yang, A. Zeng, S. Liu, F. Li, R. Zhang, and L. Zhang, “Explicit box detection unifies end-to-end multi-person pose estimation,” International Conference on Learning Representations, 2023.
- D. Zauss, S. Kreiss, and A. Alahi, “Keypoint communities,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 11 057–11 066.
- X. Huang, Z. Ge, Z. Jie, and O. Yoshie, “Nms by representative region: Towards crowded pedestrian detection by proposal pairing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 750–10 759.
- Y. Zhang, H. He, J. Li, Y. Li, J. See, and W. Lin, “Variational pedestrian detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 622–11 631.
- Y. Tang, B. Li, M. Liu, B. Chen, Y. Wang, and W. Ouyang, “Autopedestrian: an automatic data augmentation and loss function search scheme for pedestrian detection,” IEEE Transactions on Image Processing, vol. 30, pp. 8483–8496, 2021.
- J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 764–773.
- Huayi Zhou (15 papers)
- Fei Jiang (70 papers)
- Jiaxin Si (3 papers)
- Yue Ding (49 papers)
- Hongtao Lu (76 papers)