Universal Instance Perception as Object Discovery and Retrieval (2303.06674v2)

Published 12 Mar 2023 in cs.CV

Abstract: All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks. In this work, we present a universal instance perception model of the next generation, termed UNINEXT. UNINEXT reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm and can flexibly perceive different types of objects by simply changing the input prompts. This unified formulation brings the following benefits: (1) enormous data from different tasks and label vocabularies can be exploited for jointly training general instance-level representations, which is especially beneficial for tasks lacking in training data. (2) the unified model is parameter-efficient and can save redundant computation when handling multiple tasks simultaneously. UNINEXT shows superior performance on 20 challenging benchmarks from 10 instance-level tasks including classical image-level tasks (object detection and instance segmentation), vision-and-language tasks (referring expression comprehension and segmentation), and six video-level object tracking tasks. Code is available at https://github.com/MasterBin-IIAU/UNINEXT.

Citations (134)

View on Semantic Scholar

Summary

The paper presents UNINEXT, a unified framework that reformulates instance perception as a prompt-guided object discovery and retrieval task.
It integrates a Transformer-based encoder-decoder with a prompt generation module and early fusion of image-prompt features for robust, context-rich predictions.
Empirical results on 20 datasets, including a box AP of 60.6 with a ViT-Huge backbone, validate its superior performance and efficiency across vision tasks.

Review of "Universal Instance Perception as Object Discovery and Retrieval"

The paper "Universal Instance Perception as Object Discovery and Retrieval" presents a sophisticated approach to unifying various instance perception tasks within a single framework, termed UNINEXT. The primary innovation lies in reformulating the fragmented landscape of instance perception into a cohesive paradigm of object discovery and retrieval guided by diverse input prompts. This unification provides a substantial advancement in efficient model training and deployment across multiple sub-tasks in computer vision.

Unified Instance Perception Architecture

The authors categorize instance perception into three types of prompts: category names, language expressions, and reference annotations. This categorization simplifies the perception process into a prompt-guided object discovery and retrieval task. UNINEXT integrates these into a singular model architecture notable for its parameter efficiency and adaptability to various tasks simultaneously.

The model's design includes a sophisticated encoder-decoder configuration leveraging the strengths of Transformer-based architectures. By incorporating a prompt generation module alongside a fusion mechanism, UNINEXT enhances the ability to perceive complex visual contexts and maintain high flexibility. The early fusion of image-prompt features ensures context-rich representation, crucial for robust instance prediction.

Strong Numerical Performance

A significant aspect of UNINEXT is its experimentally validated excellence across numerous benchmarks. The paper reports superior performance figures on 20 challenging datasets spanning tasks from traditional object detection to video-level object segmentation and tracking. For instance, UNINEXT surpasses the performance of state-of-the-art models like DN-Deformable DETR by a substantial margin, achieving a box AP of 60.6 with a ViT-Huge backbone—an assertion of the model's competency.

Such quantitative results underscore the effectiveness of unified training over task-specific counterparts. The implementation demonstrates remarkable scalability across diverse vision tasks, achieving high mAP scores in detection and segmentation benchmarks while excelling in video object tracking and segmentation tasks.

Implications and Future Directions

UNINEXT's approach offers several implications. Practically, it reduces the computational footprint and complexity associated with maintaining separate models for disparate tasks, valuable for development environments with limited resources. Theoretically, it suggests a promising direction for further integration of vision tasks, proposing a paradigm that could extend to zero-shot or open-vocabulary object detection.

Future exploration might include enhancing the model's adaptability to unseen classes, potentially leveraging frameworks like contrastive learning within the prompt-guided mechanism. The design could be extended to accommodate more intricate multi-modal tasks or to explore efficient fine-tuning strategies for domain-specific applications.

Conclusion

Overall, the paper represents a comprehensive attempt at building a universal model for instance perception. While the paper stops short of claiming to fully streamline every task under a perfect umbrella, the results clearly position UNINEXT as a substantial step forward. Its success across multiple benchmarks reaffirms the viability of a unified approach, setting a precedent for future research in the domain.

UNINEXT, by amalgamating diverse task domains under a unified architecture, not only demonstrates excellent technical achievement but also sparks discussions on achieving broader generalization capabilities in machine learning models.

PDF Markdown

Related Papers

GitHub

GitHub - MasterBin-IIAU/UNINEXT: [CVPR'23] Universal Instance Perception as Object Discovery and Retrieval (1,491 stars)