Aligning and Prompting Everything All at Once for Universal Visual Perception (2312.02153v1)

Published 4 Dec 2023 in cs.CV

Abstract: Vision foundation models have been explored recently to build general-purpose vision systems. However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which is not effective in prompting object detection and visual grounding. Another line of work that focuses on pixel-level tasks often encounters a large annotation gap of things and stuff, and suffers from mutual interference between foreground-object and background-class segmentation. In stark contrast to the prevailing methods, we present APE, a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks, i.e., detection, segmentation, and grounding, as an instance-level sentence-object matching paradigm. Specifically, APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection, which efficiently scales up model prompting to thousands of category vocabularies and region descriptions while maintaining the effectiveness of cross-modality fusion. To bridge the granularity gap of different pixel-level tasks, APE equalizes semantic and panoptic segmentation to proxy instance learning by considering any isolated regions as individual instances. APE aligns vision and language representation on broad data with natural and challenging characteristics all at once without task-specific fine-tuning. The extensive experiments on over 160 datasets demonstrate that, with only one-suit of weights, APE outperforms (or is on par with) the state-of-the-art models, proving that an effective yet universal perception for anything aligning and prompting is indeed feasible. Codes and trained models are released at https://github.com/shenyunhang/APE.

Citations (19)

View on Semantic Scholar

Summary

The paper introduces APE, a unified visual system that reformulates object detection and visual grounding into an instance-level sentence-object matching framework.
It employs a gated cross-modality interaction and a thing-stuff equalizing mechanism to enhance efficiency and achieve state-of-the-art performance across benchmarks.
APE demonstrates robust results on datasets like LVIS, MSCOCO, and ADE20K, streamlining multi-task visual perception with a single set of weights.

Aligning and Prompting Everything All at Once for Universal Visual Perception: A Summary

The paper "Aligning and Prompting Everything All at Once for Universal Visual Perception" by Yunhang Shen et al. describes the development of a versatile vision system called APE (Align and Prompt Everything), which is designed to handle multiple visual perception tasks such as detection, segmentation, and grounding. APE challenges current methodologies by introducing an instance-level sentence-object matching paradigm, thereby achieving state-of-the-art or competitive performance across a variety of datasets with a single suite of weights.

Key Innovations and Methodology

Instance-level Sentence-Object Matching

APE innovates by reformulating conventional object detection and visual grounding tasks into a unified instance-level sentence-object matching framework. This stands in contrast to existing models that often require separate processing for instance-level and pixel-level tasks, leading to a more seamless integration across tasks and reducing cross-modality interaction complexity.

Gated Cross-modality Interaction

To address the inefficiencies in cross-modality interactions in prior work, APE introduces a gated interaction mechanism. This mechanism applies selective attention to visual and linguistic components, allowing efficient scale-up to thousands of category vocabularies and corresponding descriptions without significantly increasing computational overhead.

Thing-Stuff Equalizing

APE equalizes the segmentation granularity of foreground objects (things) and background elements (stuff) by converting category-level segmentation challenges into instance-level objectives. This strategy eliminates the need for manually distinguishing between things and stuff, thus providing a more uniform learning process.

Performance Analysis

Object Detection

APE was evaluated on several benchmarks, including large-scale and long-tailed datasets like LVIS and common object detection datasets such as MSCOCO and Objects365. Results show that APE achieves state-of-the-art performance, outperforming other models like GLIP, OWL, and UNINEXT. Notably, APE's superior performance on ODinW and Roboflow, which feature varied and real-world scenarios, demonstrates its robustness and generalizability.

Image Segmentation

For segmentation tasks, APE outperforms existing models on benchmarks such as ADE20K, PC-459, and SegInW, which include large numbers of categories. APE's strategy of treating both things and stuff equally significantly enhances model performance in panoptic tasks despite using instance-level data. The model shows competitive results even on datasets with fewer categories like VOC and Cityscapes.

Visual Grounding

APE's reformulation of grounding as open-vocabulary detection improves efficiency and sets new performance benchmarks on visual grounding tasks, particularly on the D3 dataset. This method allows APE to manage numerous descriptions simultaneously, showing significant improvements in both intra- and inter-scenario evaluations.

Implications and Future Directions

Overall, APE makes bold claims about the feasibility of creating an effective and universal perception model by aligning diverse tasks under a unified framework. This model's ability to perform multiple tasks simultaneously with a single set of weights is a significant advancement, simplifying the pipeline for future applications in real-world scenarios.

Looking forward, the principles underlying APE's design open avenues for further research into more generalized and scalable AI systems. Future enhancements could include fine-grained text and visual interactions, expanding the model's capabilities to even broader sets of tasks and domains. The release of codes and trained models provides a foundation for continued exploration and adaptation within the research community.

In conclusion, APE represents a substantial step towards building a universal visual perception system by efficiently integrating detection, segmentation, and grounding tasks. Its innovations in instance-level sentence-object matching and cross-modality interaction promise to streamline complex AI workflows, setting a new standard for vision foundation models.

PDF Markdown

Related Papers

GitHub

GitHub - shenyunhang/APE: Aligning and Prompting Everything All at Once for Universal Visual Perception (443 stars)

Reddit

Aligning and Prompting Everything All at Once for Universal Visual Perception (30 points, 9 comments)