Image Re-Identification: Where Self-supervision Meets Vision-Language Learning (2407.20647v1)
Abstract: Recently, large-scale vision-language pre-trained models like CLIP have shown impressive performance in image re-identification (ReID). In this work, we explore whether self-supervision can aid in the use of CLIP for image ReID tasks. Specifically, we propose SVLL-ReID, the first attempt to integrate self-supervision and pre-trained CLIP via two training stages to facilitate the image ReID. We observe that: 1) incorporating language self-supervision in the first training stage can make the learnable text prompts more distinguishable, and 2) incorporating vision self-supervision in the second training stage can make the image features learned by the image encoder more discriminative. These observations imply that: 1) the text prompt learning in the first stage can benefit from the language self-supervision, and 2) the image feature learning in the second stage can benefit from the vision self-supervision. These benefits jointly facilitate the performance gain of the proposed SVLL-ReID. By conducting experiments on six image ReID benchmark datasets without any concrete text labels, we find that the proposed SVLL-ReID achieves the overall best performances compared with state-of-the-arts. Codes will be publicly available at https://github.com/BinWangGzhu/SVLL-ReID.
- S. Li, L. Sun, and Q. Li, “Clip-reid: exploiting vision-language model for image re-identification without concrete text labels,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2023, pp. 1405–1413.
- S. Yang, W. Liu, and et al., “Diverse feature learning network with attention suppression and part level background suppression for person re-identification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 1, pp. 283–297, 2023.
- Z. Yu, L. Li, and et al., “Pedestrian 3d shape understanding for person re-identification via multi-view learning,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2024.
- P. Khorramshahi, V. Shenoy, and R. Chellappa, “Robust and scalable vehicle re-identification via self-supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5295–5304.
- Z. Chai, Y. Ling, and et al., “Dual-stream transformer with distribution alignment for visible-infrared person re-identification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 11, pp. 6764–6776, 2023.
- S. He, H. Luo, and et al., “Transreid: Transformer-based object re-identification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 013–15 022.
- K. Zhu, H. Guo, and et al., “Aaformer: Auto-aligned transformer for person re-identification,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–11, 2023.
- H. Zhu, W. Ke, and et al., “Dual cross-attention learning for fine-grained visual categorization and object re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4692–4702.
- T. Wang, H. Liu, and et al., “Pose-guided feature disentangling for occluded person re-identification based on transformer,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 2540–2549.
- W. Li, C. Zou, and et al., “Dc-former: Diverse and compact transformer for person re-identification,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2023, pp. 1415–1423.
- S. Kim, S. Kang, H. Choi, S. S. Kim, and K. Seo, “Keypoint aware robust representation for transformer-based re-identification of occluded person,” IEEE Signal Processing Letters, vol. 30, pp. 65–69, 2023.
- Z. Lu, R. Lin, and H. Hu, “Mart: Mask-aware reasoning transformer for vehicle re-identification,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 2, pp. 1994–2009, 2023.
- Z. Yu, Z. Huang, J. Pei, L. Tahsin, and D. Sun, “Semantic-oriented feature coupling transformer for vehicle re-identification in intelligent transportation system,” IEEE Transactions on Intelligent Transportation Systems, pp. 1–11, 2023.
- F. Shen, Y. Xie, J. Zhu, X. Zhu, and H. Zeng, “Git: Graph interactive transformer for vehicle re-identification,” IEEE Transactions on Image Processing, vol. 32, pp. 1039–1051, 2023.
- Z. Gao, P. Chen, and et al., “A semantic perception and cnn-transformer hybrid network for occluded person re-identification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 4, pp. 2010–2025, 2023.
- A. Radford, J. W. Kim, and et al., “Learning transferable visual models from natural language supervision,” in Proceedings of the ACM International Conference on Machine Learning, 2021, pp. 8748–8763.
- H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang, “Bag of tricks and a strong baseline for deep person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, p. 1487–1495.
- J. Miao, Y. Wu, and et al., “Pose-guided feature alignment for occluded person re-identification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 542–551.
- Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated by gan improve the person re-identification baseline in vitro,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 3754–3762.
- L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer gan to bridge domain gap for person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 79–88.
- L. Zheng, L. Shen, and et al., “Scalable person re-identification: A benchmark,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015, pp. 1116–1124.
- H. Liu, Y. Tian, and et al., “Deep relative distance learning: Tell the difference between similar vehicles,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 2167–2175.
- X. Liu, W. Liu, T. Mei, and H. Ma, “A deep learning-based approach to progressive vehicle re-identification for urban surveillance,” in Proceedings of the European Conference on Computer Vision, 2016, pp. 869–884.
- Y. Lin, C. Liu, Y. Chen, J. Hu, B. Yin, B. Yin, and Z. Wang, “Exploring part-informed visual-language learning for person re-identification,” arXiv preprint arXiv:2308.02738, 2023.
- H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 782–791.
- Bin Wang (750 papers)
- Yuying Liang (2 papers)
- Lei Cai (17 papers)
- Huakun Huang (1 paper)
- Huanqiang Zeng (19 papers)