Towards Open Vocabulary Learning: A Survey (2306.15880v4)

Published 28 Jun 2023 in cs.CV and cs.AI

Abstract: In the field of visual scene understanding, deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection. However, most approaches operate on the close-set assumption, meaning that the model can only identify pre-defined categories that are present in the training set. Recently, open vocabulary settings were proposed due to the rapid progress of vision language pre-training. These new approaches seek to locate and recognize categories beyond the annotated label space. The open vocabulary approach is more general, practical, and effective compared to weakly supervised and zero-shot settings. This paper provides a thorough review of open vocabulary learning, summarizing and analyzing recent developments in the field. In particular, we begin by comparing it to related concepts such as zero-shot learning, open-set recognition, and out-of-distribution detection. Then, we review several closely related tasks in the case of segmentation and detection, including long-tail problems, few-shot, and zero-shot settings. For the method survey, we first present the basic knowledge of detection and segmentation in close-set as the preliminary knowledge. Next, we examine various scenarios in which open vocabulary learning is used, identifying common design elements and core ideas. Then, we compare the recent detection and segmentation approaches in commonly used datasets and benchmarks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To our knowledge, this is the first comprehensive literature review of open vocabulary learning. We keep tracing related works at https://github.com/jianzongwu/Awesome-Open-Vocabulary.

Citations (104)

View on Semantic Scholar

Summary

The paper presents a comprehensive review of open vocabulary learning methods that extend neural networks beyond closed-set limitations.
It details how vision-language pre-training boosts performance in segmentation, detection, and tracking of unseen object classes.
It highlights challenges such as model overfitting, high training costs, and benchmarking needs to guide future research directions.

Towards Open Vocabulary Learning: A Survey

The paper "Towards Open Vocabulary Learning: A Survey" provides a comprehensive review of open vocabulary learning within the context of visual scene understanding. It extends the capabilities of deep neural networks in handling tasks such as segmentation, detection, and tracking under open vocabulary conditions. The paper contrasts open vocabulary learning with zero-shot learning, open-set recognition, and out-of-distribution detection, emphasizing its broader applicability and practicality.

Overview

Visual scene understanding tasks traditionally rely on a closed-set assumption, limiting models to pre-defined categories present in the training set. Open vocabulary learning, however, seeks to transcend these constraints by identifying and categorizing entities beyond the annotated label space. This method leverages the rapid advancements in vision-language pre-training.

Fundamental Concepts

The survey starts by defining core concepts and comparing open vocabulary learning with existing methodologies like zero-shot learning and open-set recognition. It frames open vocabulary learning as a method that utilizes expansive language vocabularies combined with visual inputs, enabling models to generalize more effectively across unseen categories.

Recent Developments

Several key areas within open vocabulary learning are extensively reviewed:

Object Detection: The paper highlights methods such as OVR-CNN and ViLD that make use of vision-LLMs like CLIP to enhance object detection capabilities. These approaches often use image-text pairs for dense captioning to improve model generalizability for novel object classes.
Segmentation: Approaches like OpenSeg and MaskCLIP harness large visual-language pre-trained models to improve segmentation processes. The survey details how these models align text encodings with dense visual features for better performance on unseen class segmentation.
Video and 3D Understanding: The paper explores open vocabulary learning's application to video segmentation, tracking, and 3D scene interpretation. Video understanding methods integrate temporal information with visual-linguistic features, while methods for 3D data leverage 2D LLMs to extend the recognition capabilities.

Challenges and Future Directions

Despite the progress, several challenges remain:

Base Class Over-fitting: Ensuring that models do not overly rely on base class data remains a key issue.
Training Cost: The resource-intensive nature of pre-training large models is a barrier to entry for many research groups.
Cross-domain Generalization: Developing models that consistently perform across diverse datasets is crucial.
Benchmarking: There is a need for better datasets and metrics to fully capture the nuances of novel category recognition.

Future research directions include leveraging temporal information in video data, exploring 3D scene understanding more deeply, and designing task-specific adapters for foundation models. Furthermore, the integration of LLMs presents opportunities to boost generalization and classification capabilities.

Conclusion

This survey acts as a crucial resource for researchers aiming to extend the boundaries of open vocabulary learning. By systematically reviewing recent literature and synthesizing current knowledge, it lays the groundwork for future advancements in the field. The diversification of datasets and the evolution of benchmarks will continue to drive innovation, making open vocabulary learning a pivotal area of paper within AI and machine learning.

PDF Markdown

Related Papers

GitHub

GitHub - jianzongwu/Awesome-Open-Vocabulary: (TPAMI 2024) A Survey on Open Vocabulary Learning (711 stars)

Tweets

https://twitter.com/xtl994/status/1752616373912776889