Prototype-Guided Text-based Person Search based on Rich Chinese Descriptions (2312.14834v1)
Abstract: Text-based person search aims to simultaneously localize and identify the target person based on query text from uncropped scene images, which can be regarded as the unified task of person detection and text-based person retrieval task. In this work, we propose a large-scale benchmark dataset named PRW-TPS-CN based on the widely used person search dataset PRW. Our dataset contains 47,102 sentences, which means there is quite more information than existing dataset. These texts precisely describe the person images from top to bottom, which in line with the natural description order. We also provide both Chinese and English descriptions in our dataset for more comprehensive evaluation. These characteristics make our dataset more applicable. To alleviate the inconsistency between person detection and text-based person retrieval, we take advantage of the rich texts in PRW-TPS-CN dataset. We propose to aggregate multiple texts as text prototypes to maintain the prominent text features of a person, which can better reflect the whole character of a person. The overall prototypes lead to generating the image attention map to eliminate the detection misalignment causing the decrease of text-based person retrieval. Thus, the inconsistency between person detection and text-based person retrieval is largely alleviated. We conduct extensive experiments on the PRW-TPS-CN dataset. The experimental results show the PRW-TPS-CN dataset's effectiveness and the state-of-the-art performance of our approach.
- S. Choi, S. Lee, Y. Kim, T. Kim, and C. Kim, “Hi-cmd: Hierarchical cross-modality disentanglement for visible-infrared person re-identification,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 257–10 266.
- X. Lin, J. Li, Z. Ma, H. Li, S. Li, K. Xu, G. Lu, and D. Zhang, “Learning modal-invariant and temporal-memory for video-based visible-infrared person re-identification,” in IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 973–20 982.
- Q. Zhang, C. Lai, J. Liu, N. Huang, and J. Han, “Fmcnet: Feature-level modality compensation for visible-infrared person re-identification,” in IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 7349–7358.
- Z. Wang, Z. Fang, J. Wang, and Y. Yang, “Vitaa: Visual-textual attributes alignment in person search by natural language,” in European Conference on Computer Vision, 2020, pp. 402–420.
- K. Niu, Y. Huang, W. Ouyang, and L. Wang, “Improving description-based person re-identification by multi-granularity image-text alignments,” IEEE Transactions on Image Processing, vol. 29, pp. 5542–5556, 2020.
- A. Farooq, M. Awais, J. Kittler, and S. S. Khalid, “Axm-net: Cross-modal context sharing attention network for person re-id,” arXiv preprint arXiv:2101.08238, 2021.
- Y. Chen, R. Huang, H. Chang, C. Tan, T. Xue, and B. Ma, “Cross-modal knowledge adaptation for language-based person search,” IEEE Transactions on Image Processing, vol. 30, pp. 4057–4069, 2021.
- S. Zhang, D. Long, Y. Gao, L. Gao, Q. Zhang, K. Niu, and Y. Zhang, “Text-based person search in full images via semantic-driven proposal generation,” arXiv preprint arXiv:2109.12965, 2021.
- Q. Yang, A. Wu, and W.-S. Zheng, “Person re-identification by contour sketch under moderate clothing change,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 6, pp. 2029–2046, 2019.
- Q. Chen, Z. Quan, K. Zhao, Y. Zheng, Z. Liu, and Y. Li, “A cross-modality sketch person re-identification model based on cross-spectrum image generation,” in International Forum on Digital TV and Wireless Multimedia Communications, 2022, pp. 312–324.
- S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang, “Person search with natural language description,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1970–1979.
- P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision, 2014, pp. 740–755.
- S. Reed, Z. Akata, H. Lee, and B. Schiele, “Learning deep representations of fine-grained visual descriptions,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 49–58.
- D. Jiang and M. Ye, “Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition, 2023.
- N. Sarafianos, X. Xu, and I. A. Kakadiaris, “Adversarial representation learning for text-to-image matching,” in IEEE International Conference on Computer Vision, 2019, pp. 5814–5824.
- Z. Zheng, L. Zheng, M. Garrett, Y. Yang, M. Xu, and Y.-D. Shen, “Dual-path convolutional image-text embeddings with instance loss,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 16, no. 2, pp. 1–23, 2020.
- Z. Wu, B. Ma, H. Chang, and S. Shan, “Refined knowledge transfer for language-based person search,” IEEE Transactions on Multimedia, pp. 1–15, 2023.
- Y. Xu, B. Ma, R. Huang, and L. Lin, “Person search in a scene by jointly modeling people commonness and person uniqueness,” in ACM International Conference on Multimedia, 2014, pp. 937–940.
- L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian, “Person re-identification in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1367–1376.
- D. Chen, S. Zhang, W. Ouyang, J. Yang, and Y. Tai, “Person search via a mask-guided two-stream cnn model,” in European Conference on Computer Vision, 2018, pp. 734–750.
- X. Lan, X. Zhu, and S. Gong, “Person search by multi-scale matching,” in European Conference on Computer Vision, 2018, pp. 536–552.
- C. Han, J. Ye, Y. Zhong, X. Tan, C. Zhang, C. Gao, and N. Sang, “Re-id driven localization refinement for person search,” in IEEE International Conference on Computer Vision, 2019, pp. 9814–9823.
- W. Dong, Z. Zhang, C. Song, and T. Tan, “Instance guided proposal network for person search,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 2585–2594.
- C. Wang, B. Ma, H. Chang, S. Shan, and X. Chen, “Tcts: A task-consistent two-stage framework for person search,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 952–11 961.
- T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang, “Joint detection and identification feature learning for person search,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3415–3424.
- H. Liu, J. Feng, Z. Jie, K. Jayashree, B. Zhao, M. Qi, J. Jiang, and S. Yan, “Neural person search machines,” in IEEE International Conference on Computer Vision, 2017, pp. 493–501.
- X. Chang, P.-Y. Huang, Y.-D. Shen, X. Liang, Y. Yang, and A. G. Hauptmann, “Rcaa: Relational context-aware agents for person search,” in European Conference on Computer Vision, 2018, pp. 84–100.
- J. Xiao, Y. Xie, T. Tillo, K. Huang, Y. Wei, and J. Feng, “Ian: The individual aggregation network for person search,” Pattern Recognition, vol. 87, pp. 332–340, 2019.
- Y. Yan, Q. Zhang, B. Ni, W. Zhang, M. Xu, and X. Yang, “Learning context graph for person search,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2158–2167.
- B. Munjal, S. Amin, F. Tombari, and F. Galasso, “Query-guided end-to-end person search,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 811–820.
- D. Chen, S. Zhang, J. Yang, and B. Schiele, “Norm-aware embedding for efficient person search,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 615–12 624.
- H. Kim, S. Joung, I.-J. Kim, and K. Sohn, “Prototype-guided saliency feature learning for person search,” in IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 4865–4874.
- B.-J. Han, K. Ko, and J.-Y. Sim, “End-to-end trainable trident person search network using adaptive gradient propagation,” in IEEE International Conference on Computer Vision, 2021, pp. 925–933.
- Y. Yan, J. Li, J. Qin, S. Bai, S. Liao, L. Liu, F. Zhu, and L. Shao, “Anchor-free person search,” in IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 7690–7699.
- J. Cao, Y. Pang, R. M. Anwer, H. Cholakkal, J. Xie, M. Shah, and F. S. Khan, “Pstr: End-to-end one-step person search with transformers,” in IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 9458–9467.
- R. Yu, D. Du, R. LaLonde, D. Davila, C. Funk, A. Hoogs, and B. Clipp, “Cascade transformers for end-to-end person search,” in IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 7267–7276.
- Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh, “No fuss distance metric learning using proxies,” in IEEE International Conference on Computer Vision, 2017, pp. 360–368.
- Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin, “Softtriple loss: Deep metric learning without triplet sampling,” in IEEE International Conference on Computer Vision, 2019, pp. 6450–6458.
- X. Wang, H. Zhang, W. Huang, and M. R. Scott, “Cross-batch memory for embedding learning,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 6388–6397.
- R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision, vol. 123, pp. 32–73, 2017.
- H. Moon and P. J. Phillips, “Computational and performance aspects of pca-based face-recognition algorithms,” Perception, vol. 30, pp. 303–321, 2001.
- T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in IEEE International Conference on Computer Vision, 2017, pp. 2980–2988.
- Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in IEEE International Conference on Computer Vision, 2019, pp. 9627–9636.
- J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.
- F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “Vse++: Improving visual-semantic embeddings with hard negatives,” arXiv preprint arXiv:1707.05612, 2017.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in Neural Information Processing Systems, vol. 28, 2015.
- W. Dong, Z. Zhang, C. Song, and T. Tan, “Bi-directional interaction network for person search,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 2839–2848.
- K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, “MMDetection: Open mmlab detection toolbox and benchmark,” arXiv preprint arXiv:1906.07155, 2019.
- R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in IEEE International Conference on Computer Vision, 2017, pp. 618–626.
- Ziqiang Wu (2 papers)
- Bingpeng Ma (22 papers)