Described Object Detection: Liberating Object Detection with Flexible Expressions (2307.12813v2)

Published 24 Jul 2023 in cs.CV

Abstract: Detecting objects based on language information is a popular task that includes Open-Vocabulary object Detection (OVD) and Referring Expression Comprehension (REC). In this paper, we advance them to a more practical setting called Described Object Detection (DOD) by expanding category names to flexible language expressions for OVD and overcoming the limitation of REC only grounding the pre-existing object. We establish the research foundation for DOD by constructing a Description Detection Dataset ($D^3$). This dataset features flexible language expressions, whether short category names or long descriptions, and annotating all described objects on all images without omission. By evaluating previous SOTA methods on $D^3$, we find some troublemakers that fail current REC, OVD, and bi-functional methods. REC methods struggle with confidence scores, rejecting negative instances, and multi-target scenarios, while OVD methods face constraints with long and complex descriptions. Recent bi-functional methods also do not work well on DOD due to their separated training procedures and inference strategies for REC and OVD tasks. Building upon the aforementioned findings, we propose a baseline that largely improves REC methods by reconstructing the training data and introducing a binary classification sub-task, outperforming existing methods. Data and code are available at https://github.com/shikras/d-cube and related works are tracked in https://github.com/Charles-Xie/awesome-described-object-detection.

Citations (19)

View on Semantic Scholar

Summary

The paper introduces the DOD task that integrates open-vocabulary detection and referring expression comprehension to handle intricate language expressions.
It establishes the D³ dataset with comprehensive annotations covering both presence and absence descriptions to rigorously evaluate detection models.
The proposed baseline, enhanced from the OFA model, demonstrates improved false positive rejection and multi-target detection capabilities.

Described Object Detection: Liberating Object Detection with Flexible Expressions

The paper "Described Object Detection: Liberating Object Detection with Flexible Expressions" presents a significant progression in the area of object detection by formalizing a new task: Described Object Detection (DOD). This task encapsulates both Open-Vocabulary object Detection (OVD) and Referring Expression Comprehension (REC), aiming to overcome the inherent limitations of each by introducing a more nuanced framework that accommodates flexible language expressions.

Introduction to Described Object Detection

Traditional OVD approaches allow for the detection of objects with short categorical labels, which limits the model's capability to identify objects described by longer and more intricate language expressions. Meanwhile, REC methods focus predominantly on grounding a single object referenced by a language phrase, assuming the existence of such an object within an image. This can lead to an increase in false positives when the targeted object is absent. The proposed DOD framework addresses these gaps by incorporating the flexibility to detect objects described by both short and long expressions, ensuring models can accurately navigate scenarios regardless of the complexity of language used.

The Description Detection Dataset (D³)

A foundational aspect of the paper is the introduction of the Description Detection Dataset (D³), which serves as the benchmark for DOD. D³ is comprehensive in its annotation, featuring diverse language expressions that vary from single-word categorical names to elaborate sentences. Notably, this dataset includes both presence and absence descriptions, adding a layer of complexity that existing datasets do not address. The D³ dataset's exhaustive annotation ensures that all described objects within an image are accounted for, facilitating a more accurate evaluation of DOD capabilities.

Evaluation of Existing Methods

The paper meticulously evaluates several state-of-the-art OVD, REC, and bi-functional models on the D³ dataset. It identifies key challenges and performance bottlenecks inherent in these existing models when applied to DOD tasks. REC models lag in rejecting negative instances and managing scenarios where multiple objects are targeted by a description. Conversely, OVD models struggle with handling long and complex descriptions, highlighting their limitations in adapting to flexible language inputs without additional training on such data.

Proposed Baseline Method

To address the deficiencies observed in existing techniques, the authors propose a baseline built on modifications to the pre-existing OFA model. Key enhancements include reconstructing training data to better capture multi-target scenarios and introducing a binary classification sub-task to improve the model's ability to reject false positives. This method indicates a marked improvement over existing models in managing various nuanced aspects of DOD, although it is acknowledged to be a step towards a more robust solution rather than a complete one.

Implications and Future Directions

The implications of this research are twofold. Practically, the ability to use flexible expressions in object detection models opens up applications in fields like urban security, network security, and autonomous driving, where specific detections based on complex human descriptions are crucial. Theoretically, this work bridges a gap in visual-language understanding, showcasing how nuanced interpretation of language can enhance model robustness and accuracy in complex scenarios.

Future research should focus on scaling this approach, addressing the annotation cost associated with dataset creation, and exploring deeper integration of language understanding components to further refine DOD methodologies. Continuous tracking of advancements in this domain through the indicated repositories ensures that the field remains dynamic and responsive to emerging methodologies and datasets.

This paper sets a groundwork for appreciating objects through descriptive language, challenging existing paradigms, and offering pathways for enriching object detection technologies. Its impact lies in transforming object detection into a task that comprehends and reacts to the diverse ways humans naturally describe their world.

PDF Markdown