Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval (2208.08608v2)

Published 18 Aug 2022 in cs.CV

Abstract: Text-based person retrieval aims to find the query person based on a textual description. The key is to learn a common latent space mapping between visual-textual modalities. To achieve this goal, existing works employ segmentation to obtain explicitly cross-modal alignments or utilize attention to explore salient alignments. These methods have two shortcomings: 1) Labeling cross-modal alignments are time-consuming. 2) Attention methods can explore salient cross-modal alignments but may ignore some subtle and valuable pairs. To relieve these issues, we introduce an Implicit Visual-Textual (IVT) framework for text-based person retrieval. Different from previous models, IVT utilizes a single network to learn representation for both modalities, which contributes to the visual-textual interaction. To explore the fine-grained alignment, we further propose two implicit semantic alignment paradigms: multi-level alignment (MLA) and bidirectional mask modeling (BMM). The MLA module explores finer matching at sentence, phrase, and word levels, while the BMM module aims to mine \textbf{more} semantic alignments between visual and textual modalities. Extensive experiments are carried out to evaluate the proposed IVT on public datasets, i.e., CUHK-PEDES, RSTPReID, and ICFG-PEDES. Even without explicit body part alignment, our approach still achieves state-of-the-art performance. Code is available at: https://github.com/TencentYoutuResearch/PersonRetrieval-IVT.

Implicit Modality Alignment for Text-based Person Retrieval

The paper "See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval" presents a novel approach to address the challenges in text-based person retrieval (TPR) by introducing an Implicit Visual-Textual (IVT) framework. This framework is designed to overcome limitations observed in existing methods, such as the explicit cross-modal alignments that require time-consuming labeling and potential neglect of subtle but valuable cross-modal pairs by attention mechanisms. The IVT framework utilizes a unified network to learn representations for both visual and textual modalities, thereby enhancing visual-textual interactions and enabling the discovery of fine-grained alignments through implicit means.

Crucially, the paper introduces two implicit semantic alignment paradigms: multi-level alignment (MLA) and bidirectional mask modeling (BMM). The MLA strategy facilitates finer alignments by engaging sentence, phrase, and word levels, effectively enabling the retrieval process to operate on varying descriptive granularities. BMM, inspired by recent advancements in masked autoencoders, encourages the mining of non-obvious semantic alignments by masking portions of visual and textual inputs, thereby forcing the model to not only focus on salient attributes but also on subtle descriptors that strengthen cross-modal matching.

Comprehensive experiments were conducted on well-established datasets such as CUHK-PEDES, RSTPReid, and ICFG-PEDES. The results demonstrate that the IVT framework achieves state-of-the-art performance, even surpassing methodologies that employ explicit part-based alignments, indicating the robustness and applicability of implicit alignment strategies in TPR. The IVT framework's superior performance highlights its capability to see "finer" and "more" than traditional methods; it leverages the shared learning tasks across modalities without reliance on costly cross-modal interaction layers such as cross-attention, thus maintaining inference efficiency.

The theoretical implications of this work suggest that unified networks, alongside sophisticated yet minimalistic implicit alignment strategies, can effectively bridge the semantic gap between visual and textual modalities in retrieval tasks. Practically, the deployment of such a framework could significantly enhance surveillance systems where prompt, accurate retrieval of individuals based on textual descriptions is often required.

Future research directions may involve the exploration of additional implicit alignment strategies and the adaptation of such frameworks to other bi-modal or multi-modal retrieval tasks. Moreover, extending the scope of unified networks to encompass more complex or dynamically changing real-world scenarios remains a promising area for further innovation and refinement in AI-driven solutions for person retrieval.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xiujun Shu (16 papers)
  2. Wei Wen (49 papers)
  3. Haoqian Wu (14 papers)
  4. Keyu Chen (76 papers)
  5. Yiran Song (7 papers)
  6. Ruizhi Qiao (18 papers)
  7. Bo Ren (60 papers)
  8. Xiao Wang (507 papers)
Citations (70)
Youtube Logo Streamline Icon: https://streamlinehq.com