Noisy-Correspondence Learning for Text-to-Image Person Re-identification (2308.09911v3)
Abstract: Text-to-image person re-identification (TIReID) is a compelling topic in the cross-modal community, which aims to retrieve the target person based on a textual query. Although numerous TIReID methods have been proposed and achieved promising performance, they implicitly assume the training image-text pairs are correctly aligned, which is not always the case in real-world scenarios. In practice, the image-text pairs inevitably exist under-correlated or even false-correlated, a.k.a noisy correspondence (NC), due to the low quality of the images and annotation errors. To address this problem, we propose a novel Robust Dual Embedding method (RDE) that can learn robust visual-semantic associations even with NC. Specifically, RDE consists of two main components: 1) A Confident Consensus Division (CCD) module that leverages the dual-grained decisions of dual embedding modules to obtain a consensus set of clean training data, which enables the model to learn correct and reliable visual-semantic associations. 2) A Triplet Alignment Loss (TAL) relaxes the conventional Triplet Ranking loss with the hardest negative samples to a log-exponential upper bound over all negative ones, thus preventing the model collapse under NC and can also focus on hard-negative samples for promising performance. We conduct extensive experiments on three public benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID, to evaluate the performance and robustness of our RDE. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on all three datasets. Code is available at https://github.com/QinYang79/RDE.
- A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning, pages 233–242. PMLR, 2017.
- Rasa: relation and sensitivity aware representation learning for text-based person search. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pages 555–563, 2023a.
- Text-based person search without parallel image-text data. In Proceedings of the 31st ACM International Conference on Multimedia, pages 757–767, 2023b.
- An empirical study of clip for text-based person search. arXiv preprint arXiv:2308.10045, 2023.
- Tipcb: A simple but effective part-based convolutional baseline for text-based person search. Neurocomputing, 494:171–181, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1218–1226, 2021.
- Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv preprint arXiv:2107.12666, 2021.
- Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4065–4080, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Learning disentangled representation for robust person re-identification. Advances in neural information processing systems, 32, 2019.
- Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612, 2017.
- Axm-net: Implicit cross-modal feature alignment for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4477–4485, 2022.
- Rono: Robust discriminative learning with noisy labels for 2d-3d cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11610–11619, 2023.
- Bilma: Bidirectional local-matching for text-based person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2786–2790, 2023.
- Contextual non-local alignment over full-scale representation for text-based person search. arXiv preprint arXiv:2101.03036, 2021.
- Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems, 31, 2018.
- Noisy correspondence learning with meta similarity correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7517–7526, 2023.
- Text-based person search with limited data. arXiv preprint arXiv:2110.10807, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Learning cross-modal retrieval with noisy labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5403–5413, 2021.
- Cross-modal retrieval with partially mismatched pairs. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–15, 2023.
- Learning with noisy correspondence for cross-modal matching. Advances in Neural Information Processing Systems, 34:29406–29419, 2021.
- Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2787–2797, 2023.
- Pose-guided multi-granularity attention network for text-based person search. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11189–11196, 2020.
- Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394, 2020.
- Person search with natural language description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1970–1979, 2017.
- Learning semantic-aligned feature representation for text-based person search. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2724–2728. IEEE, 2022.
- Dcel: Deep cross-modal evidential learning for text-based person retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, pages 6292–6300, 2023a.
- Selectively hard negative mining for alleviating gradient vanishing in image-text matching. arXiv preprint arXiv:2303.00181, 2023b.
- Multi-granularity correspondence learning from long-term noisy videos. arXiv preprint arXiv:2401.16702, 2024.
- An ensemble model for combating label noise. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pages 608–617, 2022.
- Beat: Bi-directional one-to-many embedding alignment for text-based person retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4157–4168, 2023.
- Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Transactions on Image Processing, 29:5542–5556, 2020.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Deep evidential learning with noisy correspondence for cross-modal retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4948–4956, 2022.
- Edmc: Efficient multi-view clustering via cluster and instance space learning. IEEE Transactions on Multimedia, 2023a.
- Elastic multi-view subspace clustering with pairwise and high-order correlations. IEEE Transactions on Knowledge and Data Engineering, 2023b.
- Cross-modal active complementary learning with self-refining correspondence. In Thirty-seventh Conference on Neural Information Processing Systems, 2023c.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5814–5824, 2019.
- Learning granularity-unified representations for text-to-image person re-identification. In Proceedings of the 30th ACM International Conference on Multimedia, 2022.
- Unified pre-training with pseudo texts for text-to-image person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11174–11184, 2023.
- Pedestrian-specific bipartite-aware similarity learning for text-based person retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, pages 8922–8931, 2023.
- See finer, see more: Implicit modality alignment for text-based person retrieval. In European Conference on Computer Vision, pages 624–641. Springer, 2022.
- Text-based person search via multi-granularity embedding learning. In IJCAI, pages 1068–1074, 2021.
- Multi-level fusion for person re-identification with incomplete marks. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1267–1270, 2015.
- Vitaa: Visual-textual attributes alignment in person search by natural language. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pages 402–420. Springer, 2020.
- Caibc: Capturing all-round information beyond color for text-based person retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 5314–5322, 2022a.
- Look before you leap: Improving text-based person retrieval by learning a consistent cross-modal common manifold. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1984–1992, 2022b.
- Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 79–88, 2018.
- Lapscore: language-guided person search via color reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1624–1633, 2021.
- Clip-driven fine-grained text-image person re-identification. arXiv preprint arXiv:2210.10276, 2022a.
- Image-specific information suppression and implicit local alignment for text-based person search. arXiv preprint arXiv:2208.14365, 2022b.
- Learning comprehensive representations with richer self for text-to-image person re-identification. In Proceedings of the 31st ACM international conference on multimedia, pages 6202–6211, 2023.
- Learning with twin noisy labels for visible-infrared person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14308–14317, 2022a.
- Robust multi-view clustering with incomplete information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1055–1069, 2022b.
- Robust object re-identification with coupled noisy labels. International Journal of Computer Vision, pages 1–19, 2024.
- Robust video-text retrieval via noisy pair calibration. IEEE Transactions on Multimedia, 2023.
- Deep cross-modal projection learning for image-text matching. In Proceedings of the European conference on computer vision (ECCV), pages 686–701, 2018.
- Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(2):1–23, 2020.
- Dssl: Deep surroundings-person separation learning for text-based person retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pages 209–217, 2021.
- Dual cross-attention learning for fine-grained visual categorization and object re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4692–4702, 2022.
- Yang Qin (22 papers)
- Yingke Chen (6 papers)
- Dezhong Peng (23 papers)
- Xi Peng (115 papers)
- Joey Tianyi Zhou (116 papers)
- Peng Hu (93 papers)