Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions (2306.07520v4)
Abstract: Human intelligence can retrieve any person according to both visual and language descriptions. However, the current computer vision community studies specific person re-identification (ReID) tasks in different scenarios separately, which limits the applications in the real world. This paper strives to resolve this problem by proposing a new instruct-ReID task that requires the model to retrieve images according to the given image or language instructions. Our instruct-ReID is a more general ReID setting, where existing 6 ReID tasks can be viewed as special cases by designing different instructions. We propose a large-scale OmniReID benchmark and an adaptive triplet loss as a baseline method to facilitate research in this new setting. Experimental results show that the proposed multi-purpose ReID model, trained on our OmniReID benchmark without fine-tuning, can improve +0.5%, +0.6%, +7.7% mAP on Market1501, MSMT17, CUHK03 for traditional ReID, +6.4%, +7.1%, +11.2% mAP on PRCC, VC-Clothes, LTCC for clothes-changing ReID, +11.7% mAP on COCAS+ real2 for clothes template based clothes-changing ReID when using only RGB images, +24.9% mAP on COCAS+ real2 for our newly defined language-instructed ReID, +4.3% on LLCM for visible-infrared ReID, +2.6% on CUHK-PEDES for text-to-image ReID. The datasets, the model, and code will be available at https://github.com/hwz-zju/Instruct-ReID.
- Self-supervised multimodal versatile networks. NeurIPS, 2020.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- Rasa: Relation and sensitivity aware representation learning for text-based person search. arXiv preprint arXiv:2305.13653, 2023.
- Improving deep visual representation for person re-identification by global and local image-language association. In ECCV, 2018.
- Beyond appearance: A semantic controllable self-supervised learning framework for human-centric visual tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15050–15061, 2023.
- Person re-identification by camera correlation aware feature augmentation. TPAMI, 2017.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Bridgeformer: Bridging video-text retrieval with multiple choice questions. arXiv preprint arXiv:2201.04850, 2022.
- Person reidentification using spatiotemporal appearance. In CVPR, 2006.
- Coot: Cooperative hierarchical transformer for video-text representation learning. NeurIPS, 2020.
- Clothes-changing person re-identification with rgb modality only. In CVPR, 2022.
- Text-based person search with limited data. arXiv preprint arXiv:2110.10807, 2021.
- Transreid: Transformer-based object re-identification. In CVPR, 2021.
- Fine-grained shape-appearance mutual learning for cloth-changing person re-identification. In CVPR, 2021.
- Interaction-and-aggregation network for person re-identification. In CVPR, 2019.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
- Clothing status awareness for long-term person re-identification. In CVPR, 2021.
- Semantics-aligned representation learning for person re-identification. In AAAI, 2020.
- Cloth-changing person re-identification from a single image with gait prediction and regularization. In CVPR, 2022.
- Less is more: Clipbert for video-and-language learning via sparse sampling. In CVPR, 2021.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200, 2020a.
- Self-correction for human parsing. TPAMI, 2020b.
- Person search with natural language description. In CVPR, 2017.
- Learning semantic-aligned feature representation for text-based person search. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2724–2728. IEEE, 2022a.
- Cocas+: Large-scale clothes-changing person re-identification with clothes templates. TCSVT, 2022b.
- Harmonious attention network for person re-identification. In CVPR, 2018.
- Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, 2014.
- Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. arXiv preprint arXiv:2301.06267, 2023.
- Learning memory-augmented unidirectional metrics for cross-modality person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19366–19375, 2022.
- Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
- Bence Nanay. Multimodal mental imagery. Cortex, 2018.
- OpenAI. Chatgpt. Available at https://openai.com/blog/chatgpt/, 2023.
- Training language models to follow instructions with human feedback. NeurIPS, 2022.
- Long-term cloth-changing person re-identification. In ACCV, 2020.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
- Unified pre-training with pseudo texts for text-to-image person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11174–11184, 2023.
- Semantic-guided pixel sampling for cloth-changing person re-identification. SPL, 2021a.
- Large-scale spatio-temporal person re-identification: Algorithms and benchmark. IEEE Transactions on Circuits and Systems for Video Technology, 32(7):4390–4403, 2021b.
- Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, 2018.
- Humanbench: Towards general human-centric perception with projector assisted pretraining. arXiv preprint arXiv:2303.05675, 2023.
- Training data-efficient image transformers distillation through attention. In International Conference on Machine Learning, pages 10347–10357, 2021.
- When person re-identification meets changing clothes. In CVPR Workshops, 2020.
- Person transfer gan to bridge domain gap for person re-identification. In CVPR, 2018.
- Vlm: Task-agnostic video-language model pre-training for video understanding. arXiv preprint arXiv:2105.09996, 2021a.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021b.
- Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re-identification. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2843–2851, 2022a.
- Learning with twin noisy labels for visible-infrared person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14308–14317, 2022b.
- Person re-identification by contour sketch under moderate clothing change. TPAMI, 2019.
- Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4492–4501, 2023.
- Channel augmented joint learning for visible-infrared recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13567–13576, 2021a.
- Deep learning for person re-identification: A survey and outlook. TPAMI, 2021b.
- Adversarial attribute-image person re-identification. arXiv preprint arXiv:1712.01493, 2017.
- Cocas: A large-scale clothes changing person dataset for re-identification. In CVPR, 2020.
- Hap: Structure-aware masked image modeling for human-centric perception. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Fmcnet: Feature-level modality compensation for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7349–7358, 2022.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
- Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2153–2162, 2023.
- Towards a unified middle modality learning for visible-infrared person re-identification. In Proceedings of the 29th ACM International Conference on Multimedia, pages 788–796, 2021.
- Relation-aware global attention for person re-identification. In CVPR, 2020.
- Person re-identification meets image search. arXiv preprint arXiv:1502.02171, 2015a.
- Scalable person re-identification: A benchmark. In CVPR, 2015b.
- Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984, 2016.
- Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In ICCV, 2017.
- Dual-path convolutional image-text embeddings with instance loss. TOMM, 2020.
- Pass: Part-aware self-supervised pre-training for person re-identification. In European Conference on Computer Vision, pages 198–214. Springer, 2022.
- Plip: Language-image pre-training for person representation learning. arXiv preprint arXiv:2305.08386, 2023.
- Weizhen He (4 papers)
- Yiheng Deng (3 papers)
- Shixiang Tang (48 papers)
- Qihao Chen (2 papers)
- Qingsong Xie (16 papers)
- Yizhou Wang (162 papers)
- Lei Bai (154 papers)
- Feng Zhu (139 papers)
- Rui Zhao (241 papers)
- Wanli Ouyang (358 papers)
- Donglian Qi (12 papers)
- Yunfeng Yan (8 papers)