Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SequencePAR: Understanding Pedestrian Attributes via A Sequence Generation Paradigm (2312.01640v1)

Published 4 Dec 2023 in cs.CV and cs.MM

Abstract: Current pedestrian attribute recognition (PAR) algorithms are developed based on multi-label or multi-task learning frameworks, which aim to discriminate the attributes using specific classification heads. However, these discriminative models are easily influenced by imbalanced data or noisy samples. Inspired by the success of generative models, we rethink the pedestrian attribute recognition scheme and believe the generative models may perform better on modeling dependencies and complexity between human attributes. In this paper, we propose a novel sequence generation paradigm for pedestrian attribute recognition, termed SequencePAR. It extracts the pedestrian features using a pre-trained CLIP model and embeds the attribute set into query tokens under the guidance of text prompts. Then, a Transformer decoder is proposed to generate the human attributes by incorporating the visual features and attribute query tokens. The masked multi-head attention layer is introduced into the decoder module to prevent the model from remembering the next attribute while making attribute predictions during training. Extensive experiments on multiple widely used pedestrian attribute recognition datasets fully validated the effectiveness of our proposed SequencePAR. The source code and pre-trained models will be released at https://github.com/Event-AHU/OpenPAR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. X. Wang, S. Zheng, R. Yang, A. Zheng, Z. Chen, J. Tang, and B. Luo, “Pedestrian attribute recognition: A survey,” Pattern Recognition, vol. 121, p. 108220, 2022.
  2. H. An, H.-M. Hu, Y. Guo, Q. Zhou, and B. Li, “Hierarchical reasoning network for pedestrian attribute recognition,” IEEE Transactions on Multimedia, vol. 23, pp. 268–280, 2020.
  3. A. Li, L. Liu, K. Wang, S. Liu, and S. Yan, “Clothing attributes assisted person reidentification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 5, pp. 869–878, 2014.
  4. J. Zhu, L. Liu, Y. Zhan, X. Zhu, H. Zeng, and D. Tao, “Attribute-image person re-identification via modal-consistent metric learning,” International Journal of Computer Vision, vol. 131, no. 11, pp. 2959–2976, 2023.
  5. Y. Lin, L. Zheng, Z. Zheng, Y. Wu, Z. Hu, C. Yan, and Y. Yang, “Improving person re-identification by attribute and identity learning,” Pattern recognition, vol. 95, pp. 151–161, 2019.
  6. J. Zhang, L. Lin, J. Zhu, Y. Li, Y.-c. Chen, Y. Hu, and S. C. Hoi, “Attribute-aware pedestrian detection in a crowd,” IEEE Transactions on Multimedia, vol. 23, pp. 3085–3097, 2020.
  7. Q. Peng, L. Yang, X. Xie, and J. Lai, “Learning weak semantics by feature graph for attribute-based person search,” IEEE Transactions on Image Processing, vol. 32, pp. 2580–2592, 2023.
  8. S. Yang, Y. Zhou, Z. Zheng, Y. Wang, L. Zhu, and Y. Wu, “Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 4492–4501.
  9. B. Jeong, J. Park, and S. Kwak, “Asmr: Learning attribute-based person search with adaptive semantic margin regularizer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 016–12 025.
  10. T. Li, J. Liu, W. Zhang, Y. Ni, W. Wang, and Z. Li, “Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16 266–16 275.
  11. A. H. Abdulnabi, G. Wang, J. Lu, and K. Jia, “Multi-task cnn model for attribute prediction,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1949–1959, 2015.
  12. A. Diba, A. M. Pazandeh, H. Pirsiavash, and L. Van Gool, “Deepcamp: Deep convolutional action & attribute mid-level patterns,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3557–3565.
  13. H. Fan, H.-M. Hu, S. Liu, W. Lu, and S. Pu, “Correlation graph convolutional network for pedestrian attribute recognition,” IEEE Transactions on Multimedia, vol. 24, pp. 49–60, 2022.
  14. Z. Tan, Y. Yang, J. Wan, G. Guo, and S. Z. Li, “Relation-aware pedestrian attribute recognition with graph convolutional networks,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp. 12 055–12 062, 2020.
  15. J. Wang, X. Zhu, S. Gong, and W. Li, “Attribute recognition by joint recurrent learning of context and correlation,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 531–540.
  16. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6000–6010.
  17. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2021.
  18. X. Cheng, M. Jia, Q. Wang, and J. Zhang, “A simple visual-textual baseline for pedestrian attribute recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 6994–7004, 2022.
  19. W. Li, Z. Cao, J. Feng, J. Zhou, and J. Lu, “Label2label: A language modeling framework for multi-attribute learning,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII.   Springer, 2022, pp. 562–579.
  20. J. Zhu, J. Jin, Z. Yang, X. Wu, and X. Wang, “Learning clip guided visual-text fusion transformer for video-based pedestrian attribute recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2625–2628.
  21. X. Fan, Y. Zhang, Y. Lu, and H. Wang, “Parformer: Transformer-based multi-task network for pedestrian attribute recognition,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  22. Z. Tang and J. Huang, “Drformer: Learning dual relations using transformer for pedestrian attribute recognition,” Neurocomputing, vol. 497, pp. 159–169, 2022.
  23. P. Sudowe and B. Leibe, “Patchit: Self-supervised network weight initialization for fine-grained recognition.” in BMVC, vol. 1, 2016, pp. 24–25.
  24. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8748–8763.
  25. D. Li, X. Chen, Z. Zhang, and K. Huang, “Pose guided deep model for pedestrian attribute recognition in surveillance scenarios,” in 2018 IEEE International Conference on Multimedia and Expo (ICME), 2018, pp. 1–6.
  26. J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu, “Cnn-rnn: A unified framework for multi-label image classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2285–2294.
  27. X. Zhao, L. Sang, G. Ding, J. Han, N. Di, and C. Yan, “Recurrent attention model for pedestrian attribute recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 9275–9282.
  28. H. Chen, A. Gallagher, and B. Girod, “Describing clothing by semantic attributes,” in Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part III 12.   Springer, 2012, pp. 609–623.
  29. X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, and X. Wang, “Hydraplus-net: Attentive deep features for pedestrian analysis,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 350–359.
  30. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning.   pmlr, 2015, pp. 448–456.
  31. J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
  32. D. Weng, Z. Tan, L. Fang, and G. Guo, “Exploring attribute localization and correlation for pedestrian attribute recognition,” Neurocomputing, vol. 531, pp. 140–150, 2023.
  33. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  34. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  35. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  36. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  37. R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language models for dialog applications,” arXiv preprint arXiv:2201.08239, 2022.
  38. A. Yang, B. Xiao, B. Wang, B. Zhang, C. Yin, C. Lv, D. Pan, D. Wang, D. Yan, F. Yang et al., “Baichuan 2: Open large-scale language models,” arXiv preprint arXiv:2309.10305, 2023.
  39. M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 578–10 587.
  40. S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: transforming objects into words,” pp. 11 137–11 147, 2019.
  41. X. Wang, G. Chen, G. Qian, P. Gao, X.-Y. Wei, Y. Wang, Y. Tian, and W. Gao, “Large-scale multi-modal pre-trained models: A comprehensive survey,” Machine Intelligence Research, pp. 1–36, 2023.
  42. C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International conference on machine learning.   PMLR, 2021, pp. 4904–4916.
  43. H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).   Association for Computational Linguistics, 2019.
  44. J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.
  45. S.-F. Chen, Y.-C. Chen, C.-K. Yeh, and Y.-C. Wang, “Order-free rnn with visual attention for multi-label classification,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
  46. J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3242–3250.
  47. X. Chen, H. Peng, D. Wang, H. Lu, and H. Hu, “Seqtrack: Sequence to sequence learning for visual object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 572–14 581.
  48. T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton, “Pix2seq: A language modeling framework for object detection,” in International Conference on Learning Representations, 2021.
  49. J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” ser. ICML’23.   JMLR.org, 2023.
  50. Y. Deng, P. Luo, C. C. Loy, and X. Tang, “Pedestrian attribute recognition at far distance,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 789–792.
  51. D. Li, Z. Zhang, X. Chen, H. Ling, and K. Huang, “A richly annotated dataset for pedestrian attribute recognition,” arXiv preprint arXiv:1603.07054, 2016.
  52. D. Li, Z. Zhang, X. Chen, and K. Huang, “A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios,” IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 1575–1590, 2019.
  53. J. Jia, H. Huang, X. Chen, and K. Huang, “Rethinking of Pedestrian Attribute Recognition: A Reliable Evaluation under Zero-Shot Pedestrian Identity Setting,” arXiv e-prints, p. arXiv:2107.03576, Jul. 2021.
  54. D. Li, X. Chen, and K. Huang, “Multi-attribute learning for pedestrian attribute recognition in surveillance scenarios,” in 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 2015, pp. 111–115.
  55. X. Zhao, L. Sang, G. Ding, Y. Guo, and X. Jin, “Grouping attribute recognition for pedestrian with joint recurrent learning.” in IJCAI, vol. 2018, 2018, p. 27th.
  56. N. Sarafianos, X. Xu, and I. A. Kakadiaris, “Deep imbalanced attribute classification using visual attention aggregation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 680–697.
  57. Q. Li, X. Zhao, R. He, and K. Huang, “Pedestrian attribute recognition by joint visual-semantic reasoning and knowledge distillation,” in Twenty-Eighth International Joint Conference on Artificial Intelligence IJCAI-19, 2019.
  58. K. Han, Y. Wang, H. Shu, C. Liu, C. Xu, and C. Xu, “Attribute aware pooling for pedestrian attribute recognition,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019, pp. 2456–2462.
  59. H. Guo, K. Zheng, X. Fan, H. Yu, and S. Wang, “Visual attention consistency under image transforms for multi-label image classification,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 729–739.
  60. C. Tang, L. Sheng, Z.-X. Zhang, and X. Hu, “Improving pedestrian attribute recognition with weakly-supervised multi-scale attribute-specific localization,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 4996–5005.
  61. J. Wu, Y. Huang, Z. Gao, Y. Hong, J. Zhao, and X. Du, “Inter-attribute awareness for pedestrian attribute recognition,” Pattern Recognition, vol. 131, p. 108865, 2022.
  62. L. Chen, J. Song, X. Zhang, and M. Shang, “Mcfl: multi-label contrastive focal loss for deep imbalanced pedestrian attribute recognition,” Neural Computing and Applications, vol. 34, no. 19, pp. 16 701–16 715, 2022.
  63. H. Guo, X. Fan, and S. Wang, “Visual attention consistency for human attribute recognition,” International Journal of Computer Vision, vol. 130, no. 4, pp. 1088–1106, 2022.
  64. J. Jia, N. Gao, F. He, X. Chen, and K. Huang, “Learning disentangled attribute representations for robust pedestrian attribute recognition,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, pp. 1069–1077, Jun. 2022.
  65. Y. Yang, Z. Tan, P. Tiwari, H. M. Pandey, J. Wan, Z. Lei, G. Guo, and S. Z. Li, “Cascaded split-and-aggregate learning with feature recombination for pedestrian attribute recognition,” International Journal of Computer Vision, vol. 129, no. 10, pp. 2731–2744, 2021.
  66. Y. Cao, Y. Fang, Y. Zhang, X. Hou, K. Zhang, and W. Huang, “A novel self-boosting dual-branch model for pedestrian attribute recognition,” Signal Processing: Image Communication, vol. 115, p. 116961, 2023.
  67. A. Zheng, H. Wang, J. Wang, H. Huang, R. He, and A. Hussain, “Diverse features discovery transformer for pedestrian attribute recognition,” Engineering Applications of Artificial Intelligence, vol. 119, p. 105708, 2023.
  68. W.-Q. Lu, H.-M. Hu, J. Yu, Y. Zhou, H. Wang, and B. Li, “Orientation-aware pedestrian attribute recognition based on graph convolution network,” IEEE Transactions on Multimedia, 2023.
  69. S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” 2019.
  70. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
  71. Z. Wang, T. Chen, G. Li, R. Xu, and L. Lin, “Multi-label image recognition by recurrently discovering attentional regions,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 464–472.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jiandong Jin (11 papers)
  2. Xiao Wang (507 papers)
  3. Chenglong Li (94 papers)
  4. Lili Huang (8 papers)
  5. Jin Tang (139 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.