Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based Person Re-Identification

Published 28 Dec 2023 in cs.CV | (2312.16797v1)

Abstract: The fine-grained attribute descriptions can significantly supplement the valuable semantic information for person image, which is vital to the success of person re-identification (ReID) task. However, current ReID algorithms typically failed to effectively leverage the rich contextual information available, primarily due to their reliance on simplistic and coarse utilization of image attributes. Recent advances in artificial intelligence generated content have made it possible to automatically generate plentiful fine-grained attribute descriptions and make full use of them. Thereby, this paper explores the potential of using the generated multiple person attributes as prompts in ReID tasks with off-the-shelf (large) models for more accurate retrieval results. To this end, we present a new framework called Multi-Prompts ReID (MP-ReID), based on prompt learning and LLMs, to fully dip fine attributes to assist ReID task. Specifically, MP-ReID first learns to hallucinate diverse, informative, and promptable sentences for describing the query images. This procedure includes (i) explicit prompts of which attributes a person has and furthermore (ii) implicit learnable prompts for adjusting/conditioning the criteria used towards this person identity matching. Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models. Moreover, an alignment module is designed to fuse multi-prompts (i.e., explicit and implicit ones) progressively and mitigate the cross-modal gap. Extensive experiments on the existing attribute-involved ReID datasets, namely, Market1501 and DukeMTMC-reID, demonstrate the effectiveness and rationality of the proposed MP-ReID solution.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Video-guided machine translation via dual-level back-translation. Knowledge-Based Systems, 245: 108598.
  2. Explainable person re-identification with attribute-guided metric distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11813–11822.
  3. More is better: Multi-source Dynamic Parsing Attention for Occluded Person Re-identification. In Proceedings of the ACM International Conference on Multimedia, 6840–6849.
  4. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR.
  5. AXM-Net: Implicit cross-modal feature alignment for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 4477–4485.
  6. Large-scale pre-training for person re-identification with noisy labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2476–2486.
  7. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 15013–15022.
  8. Dense interaction learning for video-based person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1490–1501.
  9. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737.
  10. Asmr: Learning attribute-based person search with adaptive semantic margin regularizer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12016–12025.
  11. Spatial and semantic consistency regularizations for pedestrian attribute recognition. In Proceedings of the IEEE/CVF international conference on computer vision, 962–971.
  12. Learning disentangled attribute representations for robust pedestrian attribute recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, 1, 1069–1077.
  13. Meta clustering learning for large-scale unsupervised person re-identification. In Proceedings of the ACM International Conference on Multimedia, 2163–2172.
  14. Uncertainty-aware multi-shot knowledge distillation for image-based object re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 11165–11172.
  15. Domain Prompt Tuning Via Meta Relabeling for Unsupervised Adversarial Adaptation. IEEE Transactions on Multimedia.
  16. Semantics-aligned representation learning for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, 07, 11173–11180. AAAI.
  17. Prompting visual-language models for efficient video understanding. In Proceedings of the European Conference on Computer Vision, 105–124. Springer.
  18. Attribute-identity embedding and self-supervised learning for scalable person re-identification. IEEE Transactions on Circuits and Systems for Video Technology, 30(10): 3472–3485.
  19. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the International Conference on Machine Learning.
  20. CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification without Concrete Text Labels. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI.
  21. Diverse part discovery: Occluded person re-identification with part-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2898–2907.
  22. Improving Person Re-identification by Attribute and Identity Learning. Pattern Recognition.
  23. Declaration-based Prompt Tuning for Visual Question Answering. In Proceedings of the Thirty-first International Joint Conference on Artificial Intelligence.
  24. Image segmentation using text and image prompts. In 2022 IEEE. In CVF Conference on Computer Vision and Pattern Recognition, 7076–7086.
  25. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 0–0.
  26. Cross-modal Co-occurrence Attributes Alignments for Person Search by Language. In Proceedings of the ACM International Conference on Multimedia, 4426–4434.
  27. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  28. Language Models as Knowledge Bases? Proceedings of the Association for Computational Linguistics.
  29. Learning transferable visual models from natural language supervision. In Proceedings of the Conference on International Conference on Machine Learning, 8748–8763.
  30. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18082–18091.
  31. Neural Machine Translation of Rare Words with Subword Units. In 54th Annual Meeting of the Association for Computational Linguistics, 1715–1725. Association for Computational Linguistics.
  32. V2P: Vision-to-prompt based multi-modal product summary generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 992–1001.
  33. UPAR: Unified Pedestrian Attribute Recognition and Person Retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 981–990.
  34. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision, 480–496.
  35. Pedestrian attribute recognition: A survey. Pattern Recognition, 121: 108220.
  36. Beyond intra-modality: a survey of heterogeneous person re-identification. In Proceedings of the International Conference on International Joint Conferences on Artificial Intelligence, 4973–4980. AAAI.
  37. Learning to Prompt for Continual Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 139–149.
  38. Fast and constrained absent keyphrase generation by prompt-based learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 10, 11495–11503.
  39. Deep learning for person re-identification: A survey and outlook. IEEE transactions on pattern analysis and machine intelligence, 44(6): 2872–2893.
  40. Multi-attribute adaptive aggregation transformer for vehicle re-identification. Information Processing & Management, 59(2): 102868.
  41. Deep Modular Co-Attention Networks for Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6281–6290.
  42. Zeng, Y. 2022. Point Prompt Tuning for Temporally Language Grounding. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003–2007.
  43. Trireid: Towards multi-modal person re-identification via descriptive fusion model. In Proceedings of the International Conference on Multimedia Retrieval, 63–71.
  44. One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era. arXiv preprint arXiv:2304.06488.
  45. Person re-identification with reinforced attribute attention selection. IEEE Transactions on Image Processing, 30: 603–616.
  46. Person re-identification using heterogeneous local graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12136–12145.
  47. Progressive Attribute Embedding for Accurate Cross-modality Person Re-ID. In Proceedings of the ACM International Conference on Multimedia, 4309–4317.
  48. Scalable person re-identification: A benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1116–1124.
  49. Person re-identification in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1367–1376.
  50. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3754–3762.
  51. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16816–16825.
  52. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.
  53. Dual cross-attention learning for fine-grained visual categorization and object re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4692–4702.
  54. Learning CLIP Guided Visual-Text Fusion Transformer for Video-based Pedestrian Attribute Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2625–2628.
Citations (6)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.