Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search (2101.03036v1)

Published 8 Jan 2021 in cs.CV

Abstract: Text-based person search aims at retrieving target person in an image gallery using a descriptive sentence of that person. It is very challenging since modal gap makes effectively extracting discriminative features more difficult. Moreover, the inter-class variance of both pedestrian images and descriptions is small. So comprehensive information is needed to align visual and textual clues across all scales. Most existing methods merely consider the local alignment between images and texts within a single scale (e.g. only global scale or only partial scale) then simply construct alignment at each scale separately. To address this problem, we propose a method that is able to adaptively align image and textual features across all scales, called NAFS (i.e.Non-local Alignment over Full-Scale representations). Firstly, a novel staircase network structure is proposed to extract full-scale image features with better locality. Secondly, a BERT with locality-constrained attention is proposed to obtain representations of descriptions at different scales. Then, instead of separately aligning features at each scale, a novel contextual non-local attention mechanism is applied to simultaneously discover latent alignments across all scales. The experimental results show that our method outperforms the state-of-the-art methods by 5.53% in terms of top-1 and 5.35% in terms of top-5 on text-based person search dataset. The code is available at https://github.com/TencentYoutuResearch/PersonReID-NAFS

PDF Abstract

Contextual Non-Local Alignment for Text-Based Person Search

The paper "Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search" addresses the complex task of aligning textual and visual data to allow effective person search using textual descriptions. Unlike the classical person re-identification approaches that depend on image queries, this research focuses on retrieving target person images from a gallery based on descriptive sentences. This is inherently more challenging due to the modality gap and small inter-class variance in descriptions and images.

The authors propose a model called NAFS (Non-local Alignment over Full-Scale representations), which provides a multi-scale solution for this task. The key idea is to handle image-text alignment across various scales to enhance feature diversity and capture comprehensive contextual information. The architecture consists of several components:

Staircase Network: Utilizes a novel network structure for generating full-scale image features while maintaining locality.
Locality-Constrained Attention with BERT: Introduces a modified BERT model to extract multi-scale text features, emphasizing locality constraints.
Contextual Non-Local Attention Mechanism: This mechanism aligns images and text features from all scales simultaneously, identifying latent alignments without restricting them to certain pre-defined scales.
Re-Ranking by Visual Neighbors (RVN): Improves the retrieval quality by evaluating image similarity in the initial ranking to further refine the search results.

The experimental results exhibit significant improvements over existing state-of-the-art methods, achieving a top-1 accuracy improvement of 5.53% and a top-5 accuracy increase of 5.35% on the CUHK-PEDES dataset. This is indicative of the model's ability to effectively bridge the modality gap and extract fine-grained alignment between visual and textual inputs.

Crucially, the paper emphasizes the need for full-scale representation processing in handling text-based person search. This extends beyond typical dual-path models by introducing a non-local contextual attention mechanism that enables dynamic correspondence across varying scales. The practical implications potentially enhance the usability of systems that rely on text descriptions to locate individuals within image datasets.

The theoretical implications suggest a broader applicability of the non-local alignment approach to other multimodal tasks, indicating a promising direction in cross-modal retrieval systems. The experimental results provide a compelling case for embracing full-scale adjusted features in various domains beyond person search, such as natural language-based image retrieval and automated surveillance.

Future work could build upon these findings by exploring further improvements in aligning disparate modal features, expanding upon the types of textual inputs beyond static descriptions, and refining attention mechanisms to capture context at even more granular levels. This would advance the capacity of AI in perceptual understanding involving complex and varied data inputs.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Chenyang Gao (6 papers)
Guanyu Cai (10 papers)
Xinyang Jiang (40 papers)
Feng Zheng (117 papers)
Jun Zhang (1008 papers)
Yifei Gong (6 papers)
Pai Peng (50 papers)
Xiaowei Guo (26 papers)
Xing Sun (94 papers)

Citations (77)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - TencentYoutuResearch/PersonReID-NAFS: Code for "Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search" (63 stars)