Person Search with Natural Language Description (1702.05729v2)

Published 19 Feb 2017 in cs.CV

Abstract: Searching persons in large-scale image databases with the query of natural language description has important applications in video surveillance. Existing methods mainly focused on searching persons with image-based or attribute-based queries, which have major limitations for a practical usage. In this paper, we study the problem of person search with natural language description. Given the textual description of a person, the algorithm of the person search is required to rank all the samples in the person database then retrieve the most relevant sample corresponding to the queried description. Since there is no person dataset or benchmark with textual description available, we collect a large-scale person description dataset with detailed natural language annotations and person samples from various sources, termed as CUHK Person Description Dataset (CUHK-PEDES). A wide range of possible models and baselines have been evaluated and compared on the person search benchmark. An Recurrent Neural Network with Gated Neural Attention mechanism (GNA-RNN) is proposed to establish the state-of-the art performance on person search.

Authors (6)

Shuang Li (203 papers)
Tong Xiao (119 papers)
Hongsheng Li (340 papers)
Bolei Zhou (134 papers)
Dayu Yue (1 paper)
Xiaogang Wang (230 papers)

Citations (350)

View on Semantic Scholar

Summary

Person Search with Natural Language Description: An Overview

This paper addresses the problem of person search within large-scale image databases using queries formatted as natural language descriptions. Traditional person search methods predominantly focus on image-based or attribute-based searches, which are limiting for practical applications such as video surveillance. The authors propose a novel methodology to effectively leverage natural language descriptions as queries, thus providing a significant augmentation to existing modalities.

Core Contributions

Dataset Creation: The authors introduce the CUHK Person Description Dataset (CUHK-PEDES), which is the first large-scale dataset annotated with natural language descriptions specific to person search tasks. This dataset comprises 40,206 images of 13,003 individuals, along with 80,412 textual annotations sourced from multiple existing person re-identification datasets. The dataset is noteworthy for its diverse vocabulary and detailed sentence structures.
Model Development: A Recurrent Neural Network (RNN) with a Gated Neural Attention mechanism (GNA-RNN) is proposed. This model is designed to evaluate affinities between person images and their corresponding text descriptions. The GNA-RNN utilizes unit-level attentions and word-level gates to dynamically assess and weight visual units and words, thereby improving the relevance of person-image-query matching.
Comparative Evaluation: The paper provides an extensive evaluation of various models including image captioning, visual QA, and visual-semantic embedding techniques alongside the proposed method. The GNA-RNN model outperforms these alternatives, establishing new benchmarks in retrieval accuracy.

Numerical Results

The experimental results highlight the effectiveness of the GNA-RNN, with a top-1 accuracy of 19.05% and a top-10 accuracy of 53.64%. The robust performance is attributed to the novel use of the gated attention mechanism which precisely aligns visual and textual elements.

Implications

Practical Applications: The development of this person search method is pertinent for enhancing the functionality of video surveillance systems. By allowing natural language inputs, the system can overcome the limitations associated with reliance on visual capture devices or predefined attribute datasets.
Theoretical Impact: The introduction of the GNA-RNN with its attention mechanisms contributes meaningfully to the field of multimodal learning by demonstrating improved methods for correlating visual and textual data streams.

Future Directions

Cross-domain Applications: The synergy between natural language processing and computer vision showcased in this work paves the way for applications beyond surveillance, such as autonomous systems and human-computer interaction.
Expanded Dataset Annotation: Further enriching the dataset with diverse languages and even more descriptive variability could yield models that generalize across broader linguistic contexts.
Enhanced Model Architectures: Future work could explore more sophisticated neural architectures or integration with other AI paradigms to further increase precision and reduce computational demands.

In conclusion, this research contributes a foundational approach to enhancing person search capabilities through natural language queries. The methodology and insights provided are significant steps toward establishing more intelligent and intuitive person retrieval systems.

PDF Markdown

Related Papers

Find Related Papers