Referring to Any Person (2503.08507v2)

Published 11 Mar 2025 in cs.CV

Abstract: Humans are undoubtedly the most important participants in computer vision, and the ability to detect any individual given a natural language description, a task we define as referring to any person, holds substantial practical value. However, we find that existing models generally fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring, that hinder progress in this area. In this work, we revisit this task from three critical perspectives: task definition, dataset design, and model architecture. We first identify five aspects of referable entities and three distinctive characteristics of this task. Next, we introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications. From a model design perspective, we integrate a multimodal LLM with an object detection framework, constructing a robust referring model named RexSeek. Experimental results reveal that state-of-the-art models, which perform well on commonly used benchmarks like RefCOCO/+/g, struggle with HumanRef due to their inability to detect multiple individuals. In contrast, RexSeek not only excels in human referring but also generalizes effectively to common object referring, making it broadly applicable across various perception tasks. Code is available at https://github.com/IDEA-Research/RexSeek

Summary

Essay on "Referring to Any Person"

Overview

The paper "Referring to Any Person" examines the complex task of identifying specific individuals in images based solely on natural language descriptions. This task, referred to as "referring to any person," highlights significant deficiencies in existing models, which often falter in real-world applications due to their inability to handle multi-instance scenarios and generalized object detection. The authors propose a comprehensive framework that includes a well-defined task, a novel dataset called HumanRef, and a refined model architecture known as RexSeek to address these challenges.

HumanRef Dataset

The HumanRef dataset is meticulously designed to overcome the limitations found in existing datasets, which mainly focus on one-to-one referential contexts. The HumanRef dataset is structured across five central referential aspects:

Attributes: Encompassing intrinsic characteristics like age, gender, and clothing.
Position: Defining the individual's spatial relationships within the scene.
Interaction: Capturing human interactions with objects or other people.
Reasoning: Requiring inference from multi-step logical sequences.
Celebrity Recognition: Identifying well-known personalities or characters.

For comprehensive testing, the dataset supports multi-instance referring, where expressions correspond to multiple individuals, thus necessitating nuanced understanding and discrimination by models.

RexSeek Model

Central to this paper is the RexSeek model, a multimodal LLM that combines advanced perception with sophisticated language comprehension. The model innovates by framing referring tasks as retrieval problems, integrating object detection with language processing. RexSeek leverages multi-stage training that begins with general object perception and refines through referring-specific data, enabling it to tackle both human-centric and generalized object detection tasks effectively.

Experimental Results

Empirical evaluations underscore the inadequacies of current state-of-the-art models when applied to HumanRef. While these models perform satisfactorily on traditional one-to-one benchmarks (e.g., RefCOCO, RefCOCO+, RefCOCOg), their performance diminishes significantly under the multi-instance conditions of HumanRef, primarily due to the training biases inherent in existing datasets. RexSeek, however, demonstrates enhanced recall and precision across various referential tasks, significantly outperforming these baselines in terms of handling multi-instance recognition and reducing false positives in referring tasks.

Implications and Future Directions

The introduction of HumanRef and RexSeek marks substantial progress in developing models capable of intricate human-object interactions and multiplicity in detection tasks. Practically, this research paves the way for advancements in multiple applications, such as robotics, autonomous vehicles, and surveillance systems that require precise human recognition in complex environments.

Theoretically, it challenges the community to refine pre-existing benchmarks and model architectures to accommodate the multifaceted nature of real-world scenarios. Future research could explore extensions of RexSeek's architecture to incorporate temporal and contextual data, thereby enhancing its applicability to dynamic environments like video streams.

In conclusion, by effectively bridging the gap between natural language descriptions and precise human identification, this work sets a foundational step towards more versatile and applicable AI systems in diverse fields. The innovative methods and insights presented are crucial as we strive toward machines that understand human-centric scenarios with the same nuance and understanding as people do.

Related Papers

Find Related Papers

GitHub

GitHub - IDEA-Research/RexSeek: Referring any person or objects given a natural language description. Code base for RexSeek and HumanRef Benchmark (8 stars)

Tweets

https://twitter.com/leizhangcs/status/1901810904850891242

https://twitter.com/mountch1cken/status/1899836781149016074

https://twitter.com/mountch1cken/status/1899703561149354106

https://twitter.com/Chandra88Moon/status/1900327042002137499

https://twitter.com/IDEACVR/status/1907262910679937082