Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UniHCP: A Unified Model for Human-Centric Perceptions (2303.02936v4)

Published 6 Mar 2023 in cs.CV

Abstract: Human-centric perceptions (e.g., pose estimation, human parsing, pedestrian detection, person re-identification, etc.) play a key role in industrial applications of visual models. While specific human-centric tasks have their own relevant semantic aspect to focus on, they also share the same underlying semantic structure of the human body. However, few works have attempted to exploit such homogeneity and design a general-propose model for human-centric tasks. In this work, we revisit a broad range of human-centric tasks and unify them in a minimalist manner. We propose UniHCP, a Unified Model for Human-Centric Perceptions, which unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture. With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines on several in-domain and downstream tasks by direct evaluation. When adapted to a specific task, UniHCP achieves new SOTAs on a wide range of human-centric tasks, e.g., 69.8 mIoU on CIHP for human parsing, 86.18 mA on PA-100K for attribute prediction, 90.3 mAP on Market1501 for ReID, and 85.8 JI on CrowdHuman for pedestrian detection, performing better than specialized models tailored for each task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (118)
  1. Posetrack: A benchmark for human pose estimation and tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5167–5176, 2018.
  2. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pages 3686–3693, 2014.
  3. Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952, 2021.
  4. FairScale authors. Fairscale: A general purpose modular pytorch library for high performance and large scale training. https://github.com/facebookresearch/fairscale, 2021.
  5. Eurocity persons: A novel benchmark for person detection in traffic scenes. IEEE transactions on pattern analysis and machine intelligence, 41(8):1844–1861, 2019.
  6. From handcrafted to deep features for pedestrian detection: a survey. IEEE transactions on pattern analysis and machine intelligence, 2021.
  7. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  8. A unified sequence interface for vision tasks. arXiv preprint arXiv:2206.07669, 2022.
  9. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11030–11039, 2020.
  10. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
  11. MMPose Contributors. Openmmlab pose estimation toolbox and benchmark. https://github.com/open-mmlab/mmpose, 2020.
  12. Machine and deep learning for sport-specific movement recognition: A systematic review of model development and performance. Journal of sports sciences, 37(5):568–600, 2019.
  13. Pedestrian attribute recognition at far distance. In Proceedings of the 22nd ACM international conference on Multimedia, pages 789–792, 2014.
  14. Pedestrian detection: An evaluation of the state of the art. IEEE transactions on pattern analysis and machine intelligence, 34(4):743–761, 2011.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  16. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. arXiv preprint arXiv:2211.03375, 2022.
  17. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.
  18. Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5337–5345, 2019.
  19. Instance-level human parsing via part grouping network. In Proceedings of the European conference on computer vision (ECCV), pages 770–785, 2018.
  20. Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 932–940, 2017.
  21. Generalizable pedestrian detection: The elephant in the room. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11328–11337, 2021.
  22. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  23. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF international conference on computer vision, pages 15013–15022, 2021.
  24. Versatile multi-modal pre-training for human-centric perception. arXiv preprint arXiv:2203.13815, 2022.
  25. Unit: Multimodal multitask learning with a unified transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1439–1449, 2021.
  26. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, jul 2014.
  27. Learning semantic neural tree for human parsing. In European Conference on Computer Vision, pages 205–221. Springer, 2020.
  28. Spatial and semantic consistency regularizations for pedestrian attribute recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 962–971, 2021.
  29. Learning disentangled attribute representations for robust pedestrian attribute recognition. 2022.
  30. Semantics-aligned representation learning for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11173–11180, 2020.
  31. Getting the look: clothing recognition and segmentation for automatic product suggestions in everyday photos. In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval, pages 105–112, 2013.
  32. Human semantic parsing for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1062–1071, 2018.
  33. Joint learning for attribute-consistent person re-identification. In European conference on computer vision, pages 134–146. Springer, 2014.
  34. Iasonas Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6129–6138, 2017.
  35. Uvim: A unified modeling approach for vision with learned guiding codes. arXiv preprint arXiv:2205.10337, 2022.
  36. General multi-label image classification with transformers. arXiv preprint arXiv:2011.14027, 2020.
  37. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (ECCV), pages 734–750, 2018.
  38. Cornernet-lite: Efficient keypoint based object detection. arXiv preprint arXiv:1904.08900, 2019.
  39. A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios. IEEE transactions on image processing, 28(4):1575–1590, 2019.
  40. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. arXiv preprint arXiv:2206.02777, 2022.
  41. Human pose regression with residual log-likelihood estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11025–11034, 2021.
  42. Multiple-human parsing in the wild. arXiv preprint arXiv:1705.07206, 2017.
  43. Self-correction for human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  44. Memory-based neighbourhood embedding for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6102–6111, 2019.
  45. Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16266–16275, 2021.
  46. Label2label: A language modeling framework for multi-attribute learning. In European Conference on Computer Vision, pages 562–579. Springer, 2022.
  47. Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, 2014.
  48. Human attribute recognition by deep hierarchical contexts. In European Conference on Computer Vision, 2016.
  49. Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527, 2022.
  50. Tokenpose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11313–11322, 2021.
  51. Unifying visual perception by dispersible points learning. arXiv preprint arXiv:2208.08630, 2022.
  52. Look into person: Joint body parsing & pose estimation network and a new benchmark. IEEE transactions on pattern analysis and machine intelligence, 41(4):871–885, 2018.
  53. Human parsing with contextualized convolutional neural network. In Proceedings of the IEEE international conference on computer vision, pages 1386–1394, 2015.
  54. Multi-grained deep feature learning for pedestrian detection. In 2018 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2018.
  55. Detr for crowd pedestrian detection. arXiv preprint arXiv:2012.06785, 2020.
  56. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  57. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  58. End-to-end comparative attention networks for person re-identification. IEEE Transactions on Image Processing, 26(7):3492–3506, 2017.
  59. Cdgnet: Class distribution guided network for human parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4473–4482, 2022.
  60. Query2label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834, 2021.
  61. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021.
  62. Hydraplus-net: Attentive deep features for pedestrian analysis. In Proceedings of the IEEE international conference on computer vision, pages 350–359, 2017.
  63. Wider face and pedestrian challenge 2018: Methods and results. arXiv preprint arXiv:1902.06854, 2019.
  64. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
  65. 12-in-1: Multi-task vision and language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10437–10446, 2020.
  66. Poseur: Direct human pose regression with transformers. arXiv preprint arXiv:2201.07412, 2022.
  67. The progress of human pose estimation: a survey and taxonomy of models applied in 2d human pose estimation. IEEE Access, 8:133330–133348, 2020.
  68. Mutual learning to adapt for joint human parsing and pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 502–517, 2018.
  69. Attribute and-or grammar for joint parsing of human pose, parts and attributes. IEEE transactions on pattern analysis and machine intelligence, 40(7):1555–1569, 2017.
  70. Deep snake for real-time instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8533–8542, 2020.
  71. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
  72. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019.
  73. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
  74. Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123, 2018.
  75. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018.
  76. Large-scale spatio-temporal person re-identification: Algorithms and benchmark. IEEE Transactions on Circuits and Systems for Video Technology, 2021.
  77. Multi-task learning with low rank attribute embedding for multi-camera person re-identification. IEEE transactions on pattern analysis and machine intelligence, 40(5):1167–1181, 2017.
  78. Person attribute recognition with a jointly-trained holistic cnn model. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 87–95, 2015.
  79. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5693–5703, 2019.
  80. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14454–14463, 2021.
  81. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6398–6407, 2020.
  82. Improving pedestrian attribute recognition with weakly-supervised multi-scale attribute-specific localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4997–5006, 2019.
  83. Pedestrian detection aided by deep learning semantic tasks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5079–5087, 2015.
  84. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  85. Jrdb-pose: A large-scale dataset for multi-person pose estimation and tracking. arXiv preprint arXiv:2210.11940, 2022.
  86. Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), sep 2018.
  87. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6449–6458, 2020.
  88. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
  89. Hierarchical human parsing with typed part-relation reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8929–8939, 2020.
  90. Anchor detr: Query design for transformer-based detector. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 2567–2575, 2022.
  91. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 79–88, 2018.
  92. Large-scale datasets for going deeper in image understanding. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 1480–1485. IEEE, 2019.
  93. Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pages 466–481, 2018.
  94. Polarmask: Single shot instance segmentation with polar representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12193–12202, 2020.
  95. Fashionformer: A simple, effective and unified baseline for human fashion segmentation and recognition. arXiv preprint arXiv:2204.04654, 2022.
  96. Vitpose: Simple vision transformer baselines for human pose estimation. arXiv preprint arXiv:2204.12484, 2022.
  97. Person re-identification by contour sketch under moderate clothing change. IEEE transactions on pattern analysis and machine intelligence, 43(6):2029–2046, 2019.
  98. Transpose: Keypoint localization via transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11802–11812, 2021.
  99. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  100. Hrformer: High-resolution vision transformer for dense predict. Advances in Neural Information Processing Systems, 34:7281–7293, 2021.
  101. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
  102. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022.
  103. How far are we from solving pedestrian detection? In Proceedings of the iEEE conference on computer vision and pattern recognition, pages 1259–1267, 2016.
  104. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3221, 2017.
  105. Widerperson: A diverse dataset for dense pedestrian detection in the wild. IEEE Transactions on Multimedia, 22(2):380–393, 2019.
  106. Pose2seg: Detection free human instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 889–898, 2019.
  107. From actemes to action: A strongly-supervised representation for detailed action understanding. In Proceedings of the IEEE international conference on computer vision, pages 2248–2255, 2013.
  108. Part-aware context network for human parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8971–8980, 2020.
  109. Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1077–1085, 2017.
  110. Progressive end-to-end object detection in crowded scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 857–866, 2022.
  111. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision, pages 1116–1124, 2015.
  112. Modanet: A large-scale street fashion dataset with polygon annotations. In Proceedings of the 26th ACM international conference on Multimedia, pages 1670–1678, 2018.
  113. Joint discriminative and generative learning for person re-identification. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2138–2147, 2019.
  114. A discriminatively learned cnn embedding for person reidentification. ACM transactions on multimedia computing, communications, and applications (TOMM), 14(1):1–20, 2017.
  115. Adaptive temporal encoding network for video instance-level human parsing. In Proceedings of the 26th ACM international conference on Multimedia, pages 1527–1535, 2018.
  116. Multi-label convolutional neural network based pedestrian attribute classification. Image and Vision Computing, 58:224–229, 2017.
  117. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  118. Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16804–16815, 2022.
Citations (34)

Summary

We haven't generated a summary for this paper yet.