Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions (2306.07520v5)

Published 13 Jun 2023 in cs.CV

Abstract: Human intelligence can retrieve any person according to both visual and language descriptions. However, the current computer vision community studies specific person re-identification (ReID) tasks in different scenarios separately, which limits the applications in the real world. This paper strives to resolve this problem by proposing a new instruct-ReID task that requires the model to retrieve images according to the given image or language instructions. Our instruct-ReID is a more general ReID setting, where existing 6 ReID tasks can be viewed as special cases by designing different instructions. We propose a large-scale OmniReID benchmark and an adaptive triplet loss as a baseline method to facilitate research in this new setting. Experimental results show that the proposed multi-purpose ReID model, trained on our OmniReID benchmark without fine-tuning, can improve +0.5%, +0.6%, +7.7% mAP on Market1501, MSMT17, CUHK03 for traditional ReID, +6.4%, +7.1%, +11.2% mAP on PRCC, VC-Clothes, LTCC for clothes-changing ReID, +11.7% mAP on COCAS+ real2 for clothes template based clothes-changing ReID when using only RGB images, +24.9% mAP on COCAS+ real2 for our newly defined language-instructed ReID, +4.3% on LLCM for visible-infrared ReID, +2.6% on CUHK-PEDES for text-to-image ReID. The datasets, the model, and code will be available at https://github.com/hwz-zju/Instruct-ReID.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Self-supervised multimodal versatile networks. NeurIPS, 2020.
  2. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  3. Rasa: Relation and sensitivity aware representation learning for text-based person search. arXiv preprint arXiv:2305.13653, 2023.
  4. Improving deep visual representation for person re-identification by global and local image-language association. In ECCV, 2018.
  5. Beyond appearance: A semantic controllable self-supervised learning framework for human-centric visual tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15050–15061, 2023.
  6. Person re-identification by camera correlation aware feature augmentation. TPAMI, 2017.
  7. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  8. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  9. Bridgeformer: Bridging video-text retrieval with multiple choice questions. arXiv preprint arXiv:2201.04850, 2022.
  10. Person reidentification using spatiotemporal appearance. In CVPR, 2006.
  11. Coot: Cooperative hierarchical transformer for video-text representation learning. NeurIPS, 2020.
  12. Clothes-changing person re-identification with rgb modality only. In CVPR, 2022.
  13. Text-based person search with limited data. arXiv preprint arXiv:2110.10807, 2021.
  14. Transreid: Transformer-based object re-identification. In CVPR, 2021.
  15. Fine-grained shape-appearance mutual learning for cloth-changing person re-identification. In CVPR, 2021.
  16. Interaction-and-aggregation network for person re-identification. In CVPR, 2019.
  17. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  18. Clothing status awareness for long-term person re-identification. In CVPR, 2021.
  19. Semantics-aligned representation learning for person re-identification. In AAAI, 2020.
  20. Cloth-changing person re-identification from a single image with gait prediction and regularization. In CVPR, 2022.
  21. Less is more: Clipbert for video-and-language learning via sparse sampling. In CVPR, 2021.
  22. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
  23. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  24. Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200, 2020a.
  25. Self-correction for human parsing. TPAMI, 2020b.
  26. Person search with natural language description. In CVPR, 2017.
  27. Learning semantic-aligned feature representation for text-based person search. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2724–2728. IEEE, 2022a.
  28. Cocas+: Large-scale clothes-changing person re-identification with clothes templates. TCSVT, 2022b.
  29. Harmonious attention network for person re-identification. In CVPR, 2018.
  30. Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, 2014.
  31. Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. arXiv preprint arXiv:2301.06267, 2023.
  32. Learning memory-augmented unidirectional metrics for cross-modality person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19366–19375, 2022.
  33. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
  34. Bence Nanay. Multimodal mental imagery. Cortex, 2018.
  35. OpenAI. Chatgpt. Available at https://openai.com/blog/chatgpt/, 2023.
  36. Training language models to follow instructions with human feedback. NeurIPS, 2022.
  37. Long-term cloth-changing person re-identification. In ACCV, 2020.
  38. Learning transferable visual models from natural language supervision. In ICML, 2021.
  39. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
  40. Unified pre-training with pseudo texts for text-to-image person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11174–11184, 2023.
  41. Semantic-guided pixel sampling for cloth-changing person re-identification. SPL, 2021a.
  42. Large-scale spatio-temporal person re-identification: Algorithms and benchmark. IEEE Transactions on Circuits and Systems for Video Technology, 32(7):4390–4403, 2021b.
  43. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, 2018.
  44. Humanbench: Towards general human-centric perception with projector assisted pretraining. arXiv preprint arXiv:2303.05675, 2023.
  45. Training data-efficient image transformers distillation through attention. In International Conference on Machine Learning, pages 10347–10357, 2021.
  46. When person re-identification meets changing clothes. In CVPR Workshops, 2020.
  47. Person transfer gan to bridge domain gap for person re-identification. In CVPR, 2018.
  48. Vlm: Task-agnostic video-language model pre-training for video understanding. arXiv preprint arXiv:2105.09996, 2021a.
  49. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021b.
  50. Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re-identification. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2843–2851, 2022a.
  51. Learning with twin noisy labels for visible-infrared person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14308–14317, 2022b.
  52. Person re-identification by contour sketch under moderate clothing change. TPAMI, 2019.
  53. Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4492–4501, 2023.
  54. Channel augmented joint learning for visible-infrared recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13567–13576, 2021a.
  55. Deep learning for person re-identification: A survey and outlook. TPAMI, 2021b.
  56. Adversarial attribute-image person re-identification. arXiv preprint arXiv:1712.01493, 2017.
  57. Cocas: A large-scale clothes changing person dataset for re-identification. In CVPR, 2020.
  58. Hap: Structure-aware masked image modeling for human-centric perception. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  59. Fmcnet: Feature-level modality compensation for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7349–7358, 2022.
  60. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  61. Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2153–2162, 2023.
  62. Towards a unified middle modality learning for visible-infrared person re-identification. In Proceedings of the 29th ACM International Conference on Multimedia, pages 788–796, 2021.
  63. Relation-aware global attention for person re-identification. In CVPR, 2020.
  64. Person re-identification meets image search. arXiv preprint arXiv:1502.02171, 2015a.
  65. Scalable person re-identification: A benchmark. In CVPR, 2015b.
  66. Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984, 2016.
  67. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In ICCV, 2017.
  68. Dual-path convolutional image-text embeddings with instance loss. TOMM, 2020.
  69. Pass: Part-aware self-supervised pre-training for person re-identification. In European Conference on Computer Vision, pages 198–214. Springer, 2022.
  70. Plip: Language-image pre-training for person representation learning. arXiv preprint arXiv:2305.08386, 2023.
Citations (7)

Summary

We haven't generated a summary for this paper yet.